Package: html2text
Version: 1.3.2a-14
Severity: normal

Hi,

trying to create the Spanish documentation of aptitude (from it's repository,
revision 3228:c354bd7ae8c7) using
$ make -C debug/doc/es
which calls

rm -fr output-txt
xsltproc -o output-txt/index.html ../../../doc/es/../aptitude-txt.xsl 
aptitude.xml
Error: no ID for constraint linkend: configAptInstallRecommends.
html2text -width 80 -ascii -nobs -rcfile ../../../doc/es/../aptitude-txt.style 
output-txt/index.html | ../../../doc/es/../fixup-text > README.es

results in a bogus text file README.es:

First of all a lot of UTF-8 characters are used (in an UTF-8 environment):
Examples from the first lines of the file:

<quote>
Versi????n 0.5.3.1
Copyright ???? 2004-2008 Daniel Burrows
</quote>

Removing the option -ascii (which doesn't work as expected) one still doesn't
get a proper UTF-8 file:

aptitude/debug/doc/es$ html2text -width 80 -rcfile 
../../../doc/es/../aptitude-txt.style output-txt/index.html | \
  grep "la mitad inferior de la pantalla"
|                                                 |??rea de informaci??n
(la mitad inferior de la pantalla). El ??rea de informaci???|n

As you can see the problem is the vertical column separator | which
probably interrupts two bytes of the last multibyte character and makes
the file not UTF-8 conform.

I assumed it should be easy to reproduce but failed with another error:

$ html2text -width 10 test.html
Input recoding failed due to invalid input sequence. Unconverted part of text 
follows.
#|??
|?????? ??#|??
|?????? ??#|??
|?????? ??#|??
|?????? ??#|??
|?????? ??#|??
|??????____|

This error is wrong. test.html is a proper HTML file in latin1 encoding!

So many errors ...

Jens

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (900, 'testing'), (800, 'unstable'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.26 (SMP w/1 CPU core)
Locale: LANG=de_DE.utf8, LC_CTYPE=de_DE.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages html2text depends on:
ii  libc6                         2.9-25     GNU C Library: Shared libraries
ii  libgcc1                       1:4.4.1-1  GCC support library
ii  libstdc++6                    4.4.1-1    The GNU Standard C++ Library v3

html2text recommends no packages.

Versions of packages html2text suggests:
ii  curl                          7.19.5-1   Get a file from an HTTP, HTTPS or 
ii  wget                          1.11.4-4   retrieves files from the web

-- no debconf information
öäü öäü öäü öäü öäü öäü öäü öäü öäü öäü öäü öäü öäü

Reply via email to