Package: docx2txt
Version: 1.4-1
Severity: normal

Subject: docx2txt fails on docx files which contain legitimate whitespace in 
their XML
Package: docx2txt
Version: 1.4-1
Severity: normal

a .docx file is a PKZip container.  each of its elements are XML files.

XML is generally allowed to have whitespace between members, though
most .docx generators do not do so.

consider the attached hello-world.docx.  it works with docx2txt:

    $ docx2txt < hello-world.docx
    Hello world
    $

but a simple script can pretty-print all the files in there, and after that 
conversion, docx2txt dumps something XML-ish instead of text:

    $ cat pretty-print-docx
    #!/bin/bash
    workdir="$(mktemp -d)"
    mkdir -p "$workdir/in" "$workdir/out"
    declare -a files
    files=($( zipinfo -1 "$1" ))
    unzip -q "$1" -d "$workdir/in"
    for x in "${files[@]}"; do
      mkdir -p "$workdir/out/$(dirname "$x")"
      xmllint --pretty 2 --output "$workdir/out/$x" "$workdir/in/$x"
    done
    (cd "$workdir/out" && zip -q - "${files[@]}") > "$(basename "$1" 
.docx).pretty.docx"
    $ ./pretty-print-docx hello-world.docx
    $ docx2txt < hello-world.pretty.docx 
    <?<w:document
        xmlns:o="urn:schemas-microsoft-com:office:office"
        
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships";
        xmlns:v="urn:schemas-microsoft-com:vml"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main";
        xmlns:w10="urn:schemas-microsoft-com:office:word"
        
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing";
        
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape";
        
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup";
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006";
        
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing";
        xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml";
        mc:Ignorable="w14 wp14"
      ><w:body
        ><w:p
          ><w:pPr
            ><w:pStyle
                w:val="Normal"
            /><w:rPr
            /></w:pPr
          ><w:r
            ><w:rPr
            /><w:t
              >Hello world</w:t
            ></w:r
          ></w:p
        ><w:sectPr
          ><w:type
              w:val="nextPage"
          /><w:pgSz
              w:w="12240"
              w:h="15840"
          /><w:pgMar
              w:left="1134"
              w:right="1134"
              w:header="0"
              w:top="1134"
              w:footer="0"
              w:bottom="1134"
              w:gutter="0"
          /><w:pgNumType
              w:fmt="decimal"
          /><w:formProt
              w:val="false"
          /><w:textDirection
              w:val="lrTb"
          /></w:sectPr
        ></w:body
      ></w:document
    >
    $ 

Note that Libreoffice has no problem with this pretty-printed version.

     --dkg


    

-- System Information:
Debian Release: buster/sid
  APT prefers testing-debug
  APT policy: (500, 'testing-debug'), (500, 'testing'), (500, 'oldstable'), 
(200, 'unstable-debug'), (200, 'unstable'), (1, 'experimental-debug'), (1, 
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.17.0-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages docx2txt depends on:
ii  unzip  6.0-21

docx2txt recommends no packages.

docx2txt suggests no packages.

-- no debconf information

-- System Information:
Debian Release: buster/sid
  APT prefers testing-debug
  APT policy: (500, 'testing-debug'), (500, 'testing'), (500, 'oldstable'), 
(200, 'unstable-debug'), (200, 'unstable'), (1, 'experimental-debug'), (1, 
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.17.0-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages docx2txt depends on:
ii  unzip  6.0-21

docx2txt recommends no packages.

docx2txt suggests no packages.

-- no debconf information

Reply via email to