I'm glad to hear that the situation is improved. At present, I'm unfamiliar about how other characters, like non-ASCII Latin characters, are embedded by popular PDF production workflow and how they should be handled in poppler. If you have similar trouble in future, please post to this list!
Regards, mpsuzuki On 11/29/2013 11:58 PM, Paweł Leń wrote:
Hello :) Everything works fine, thank You very much! Best Regards *-- * *Paweł Leń* 2013/11/15 suzuki toshiya <[email protected] <mailto:[email protected]>> How about this? Regards, mpsuzuki On 11/15/2013 04:26 PM, suzuki toshiya wrote: I'm trying to fix this issue by an insertion of myXmlTokenReplace() into printInfoString(). Regards, mpsuzuki On 11/14/2013 10:42 PM, Paweł Leń wrote: This is the contents of file output.xml generated by command pdftotext -bbox -htmlmeta 'myfile.pdf' 'output.xml' : <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/__DTD/xhtml1-transitional.dtd <http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>"><__html xmlns="http://www.w3.org/1999/__xhtml <http://www.w3.org/1999/xhtml>"> <head> <title>Microsoft Word - Preface&Contents_Advances_in___Lasers_and_Electro_Optics.doc<__/title> <meta name="Author" content="Teodora"/> <meta name="Creator" content="PScript5.dll Version 5.2.2"/> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/> <meta name="CreationDate" content=""/> </head> <body> <doc> <page width="482.000000 <tel:482.000000>" height="680.000000 <tel:680.000000>"> <word xMin="255.120000 <tel:255.120000>" yMin="190.576860" xMax="338.055540 <tel:338.055540>" yMax="207.269700">Advances</__word> <word xMin="344.000562 <tel:344.000562>" yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word> <word xMin="365.276724" yMin="190.576860" xMax="425.239584 <tel:425.239584>" yMax="207.269700">Lasers</__word> <word xMin="256.260624 <tel:256.260624>" yMin="207.256884" xMax="288.954240" yMax="223.949724 <tel:223.949724>">and</word> <word xMin="294.884844 <tel:294.884844>" yMin="207.256884" xMax="363.168492" yMax="223.949724 <tel:223.949724>">Electro</word> <word xMin="369.099096" yMin="207.256884" xMax="425.265216 <tel:425.265216>" yMax="223.949724 <tel:223.949724>">Optics</word> </page> </doc> </body> </html> As You can see in line 3 tag <title> contains invalid character squence with "&". The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars is needed. *-- * *Paweł Leń* 2013/11/14 suzuki toshiya <[email protected] <mailto:[email protected]> <mailto:mpsuzuki@hiroshima-u.__ac.jp <mailto:[email protected]>>> Hi, If you could post a sample XML file that you modified the output of pdftotext to fit the XML parser, it would be helpful for some kind people to develop a patch. Regards, mpsuzuki On 11/14/2013 10:04 PM, Paweł Leń wrote: Hello, I have error when running: pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml' The output xml have <title> tag on the begining of document (meta section), error appears when title contains "&" character. Title field has no CDATA and it is not quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow? Beast regards *-- * *Paweł Leń* ___________________________________________________ poppler mailing list [email protected] <mailto:[email protected]> <mailto:poppler@lists.__freedesktop.org <mailto:[email protected]>> http://lists.freedesktop.org/____mailman/listinfo/poppler <http://lists.freedesktop.org/__mailman/listinfo/poppler> <http://lists.freedesktop.org/__mailman/listinfo/poppler <http://lists.freedesktop.org/mailman/listinfo/poppler>> _________________________________________________ poppler mailing list [email protected] <mailto:[email protected]> http://lists.freedesktop.org/__mailman/listinfo/poppler <http://lists.freedesktop.org/mailman/listinfo/poppler>
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
