A Dimarts, 3 de novembre de 2009, Piotr Findeisen va escriure: > Hi! > > I started using pdftohtml form Debian's poppler-utils package for > document analysis and run across a problem that `pdftohtml -xml' can > produce invalid XML on output (at least invalid for python xml tools). > > Test case: > > # wget -q > http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ > pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python > -c 'from xml.parsers.expat import ParserCreate; > ParserCreate().ParseFile(open("x.xml"))' > > Page-1 > Traceback (most recent call last): > File "<string>", line 2, in <module> > xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, > column 63 > > the problematic character is \x11 > > I'm running version 0.12 of pdftohtml, installed from Debian > poppler-utils_0.12.0-2_i386 package. > > pdftohtml -v > pdftohtml version 0.12.0 > Copyright 2005-2009 The Poppler Developers - > http://poppler.freedesktop.org Copyright 1999-2003 Gueorgui Ovtcharov and > Rainer Dorsch > Copyright 1996-2004 Glyph & Cog, LLC
Can you please post a but at bugs.freedesktop.org? > how can i workaround this problem? You can code a patch or wait until someone fixes it. Albert > best regards, > Piotr Findeisen > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
