Package: docx2txt
Version: 1.4-1
Severity: normal
Subject: docx2txt fails on docx files which contain legitimate whitespace in
their XML
Package: docx2txt
Version: 1.4-1
Severity: normal
a .docx file is a PKZip container. each of its elements are XML files.
XML is generally allowed to have whitespace between members, though
most .docx generators do not do so.
consider the attached hello-world.docx. it works with docx2txt:
$ docx2txt < hello-world.docx
Hello world
$
but a simple script can pretty-print all the files in there, and after that
conversion, docx2txt dumps something XML-ish instead of text:
$ cat pretty-print-docx
#!/bin/bash
workdir="$(mktemp -d)"
mkdir -p "$workdir/in" "$workdir/out"
declare -a files
files=($( zipinfo -1 "$1" ))
unzip -q "$1" -d "$workdir/in"
for x in "${files[@]}"; do
mkdir -p "$workdir/out/$(dirname "$x")"
xmllint --pretty 2 --output "$workdir/out/$x" "$workdir/in/$x"
done
(cd "$workdir/out" && zip -q - "${files[@]}") > "$(basename "$1"
.docx).pretty.docx"
$ ./pretty-print-docx hello-world.docx
$ docx2txt < hello-world.pretty.docx
<?<w:document
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
mc:Ignorable="w14 wp14"
><w:body
><w:p
><w:pPr
><w:pStyle
w:val="Normal"
/><w:rPr
/></w:pPr
><w:r
><w:rPr
/><w:t
>Hello world</w:t
></w:r
></w:p
><w:sectPr
><w:type
w:val="nextPage"
/><w:pgSz
w:w="12240"
w:h="15840"
/><w:pgMar
w:left="1134"
w:right="1134"
w:header="0"
w:top="1134"
w:footer="0"
w:bottom="1134"
w:gutter="0"
/><w:pgNumType
w:fmt="decimal"
/><w:formProt
w:val="false"
/><w:textDirection
w:val="lrTb"
/></w:sectPr
></w:body
></w:document
>
$
Note that Libreoffice has no problem with this pretty-printed version.
--dkg
-- System Information:
Debian Release: buster/sid
APT prefers testing-debug
APT policy: (500, 'testing-debug'), (500, 'testing'), (500, 'oldstable'),
(200, 'unstable-debug'), (200, 'unstable'), (1, 'experimental-debug'), (1,
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 4.17.0-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages docx2txt depends on:
ii unzip 6.0-21
docx2txt recommends no packages.
docx2txt suggests no packages.
-- no debconf information
-- System Information:
Debian Release: buster/sid
APT prefers testing-debug
APT policy: (500, 'testing-debug'), (500, 'testing'), (500, 'oldstable'),
(200, 'unstable-debug'), (200, 'unstable'), (1, 'experimental-debug'), (1,
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 4.17.0-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages docx2txt depends on:
ii unzip 6.0-21
docx2txt recommends no packages.
docx2txt suggests no packages.
-- no debconf information