El Divendres, 15 de novembre de 2013, a les 19:04:11, suzuki toshiya va escriure: > How about this?
Makes sense. Commited. Cheers, Albert > > Regards, > mpsuzuki > > On 11/15/2013 04:26 PM, suzuki toshiya wrote: > > I'm trying to fix this issue by an insertion of myXmlTokenReplace() > > into printInfoString(). > > > > Regards, > > mpsuzuki > > > > On 11/14/2013 10:42 PM, Paweł Leń wrote: > >> This is the contents of file output.xml generated by command pdftotext > >> -bbox -htmlmeta 'myfile.pdf' 'output.xml' : > >> > >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html > >> xmlns="http://www.w3.org/1999/xhtml"> <head> > >> <title>Microsoft Word - > >> Preface&Contents_Advances_in_Lasers_and_Electro_Optics.doc</title> <meta > >> name="Author" content="Teodora"/> > >> <meta name="Creator" content="PScript5.dll Version 5.2.2"/> > >> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/> > >> <meta name="CreationDate" content=""/> > >> </head> > >> <body> > >> <doc> > >> > >> <page width="482.000000" height="680.000000"> > >> > >> <word xMin="255.120000" yMin="190.576860" xMax="338.055540" > >> yMax="207.269700">Advances</word> <word xMin="344.000562" > >> yMin="190.576860" xMax="359.331702" yMax="207.269700">in</word> > >> <word xMin="365.276724" yMin="190.576860" xMax="425.239584" > >> yMax="207.269700">Lasers</word> <word xMin="256.260624" > >> yMin="207.256884" xMax="288.954240" yMax="223.949724">and</word> > >> <word xMin="294.884844" yMin="207.256884" xMax="363.168492" > >> yMax="223.949724">Electro</word> <word xMin="369.099096" > >> yMin="207.256884" xMax="425.265216" yMax="223.949724">Optics</word>>> > >> </page> > >> > >> </doc> > >> </body> > >> </html> > >> > >> > >> As You can see in line 3 tag <title> contains invalid character squence > >> with "&". The title is extracted from myfile.pdf. CDATA or some kind of > >> htmlspecialchars is needed. > >> > >> > >> > >> > >> *-- > >> * > >> > >> *Paweł Leń* > >> > >> > >> > >> 2013/11/14 suzuki toshiya <[email protected] > >> <mailto:[email protected]>>>> > >> Hi, > >> > >> If you could post a sample XML file that you modified the > >> output of pdftotext to fit the XML parser, it would be > >> helpful for some kind people to develop a patch. > >> > >> Regards, > >> mpsuzuki > >> > >> On 11/14/2013 10:04 PM, Paweł Leń wrote: > >> Hello, > >> > >> I have error when running: > >> pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml' > >> > >> The output xml have <title> tag on the begining of document (meta > >> section), error appears when title contains "&" character. Title > >> field has no CDATA and it is not quoted so it causes error in my > >> xmllib parser. Can I (or You :) ) fix it somehow? > >> > >> Beast regards > >> > >> *-- > >> * > >> > >> *Paweł Leń* > >> > >> > >> > >> _________________________________________________ > >> poppler mailing list > >> [email protected] > >> <mailto:[email protected]> > >> http://lists.freedesktop.org/__mailman/listinfo/poppler > >> <http://lists.freedesktop.org/mailman/listinfo/poppler>> > > _______________________________________________ > > poppler mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
