Hello! On 03.11.2009 23:38, Reece Dunn wrote: > >> # wget -q >> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ >> pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ >> python -c 'from xml.parsers.expat import ParserCreate; >> ParserCreate().ParseFile(open("x.xml"))' >> > I'm not sure what the fix is, but the line with the error is: > <text top="632" left="152" width="58" height="0" > font="7">¥§¦©¨¥§¦¨ ¦</text> > and firefox gives: > <text top="632" left="152" width="58" height="0" > font="7">¥§¦©¨¥§¦¨ ¦</text> > ---------------------------------------------------------------^ > (that is -- it is choking on the [00|11] character; there are also > other chatacters in the latin-1 control character range (c < 0x20)). > Right. 0x11 is the /first/ one to cause problem with python xml parser.
> My initial thought is that the characters are referencing the Unicode > codepoints (e.g. in the U+2100 range). However, these all appear to be > in the ascii range (i.e. not multi-byte UTF-8 as the encoding > suggests, but I may be wrong as there look to be more characters than > what is displayed). > these problematic characters are all ASCII control characters > Instead, they look like they are codepoints into a special > mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows > box to hand at the moment, so can't verify the font name)). This would > make sense given the font="7" attribute and the seemingly random > characters. And given the greater number of characters, this looks to > be using a non-URF8 multi-byte encoding. > font="7" attribute is generated by "pdftohtml -xml" and it's reference to <font id="7" ...... /> element near the top of the produced XML document And yes, there is some font mapping involved. I tried and wrote the equation in a new .tex document, but produced PDF contained only characters I know & read. No matter how i produced PDF — pdflatex, latex & dvipdf, etc. > Someone will need to dig around in the htmltopdf code and the > rendering of non-ascii characters. > I agree this is where the problem begins, though I've never seen pdftohtml's source... best regards, Piotr
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
