[poppler] pdftohtml produces invalid XML

Piotr Findeisen Tue, 03 Nov 2009 04:06:18 -0800

Hi!

I started using pdftohtml form Debian's poppler-utils package for
document analysis and run across a problem that `pdftohtml -xml' can
produce invalid XML on output (at least invalid for python xml tools).


Test case:

    # wget -q 
http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
        pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
        python -c 'from xml.parsers.expat import ParserCreate; 
ParserCreate().ParseFile(open("x.xml"))'

    Page-1
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
    xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, 
column 63

the problematic character is \x11

I'm running version 0.12 of pdftohtml, installed from Debian
poppler-utils_0.12.0-2_i386 package.

    pdftohtml -v
    pdftohtml version 0.12.0
    Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
    Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
    Copyright 1996-2004 Glyph & Cog, LLC
      


how can i workaround this problem?
best regards,
Piotr Findeisen

signature.asc
Description: OpenPGP digital signature

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftohtml produces invalid XML

Reply via email to