Re: [poppler] XML syntax error in PdfToText tool

suzuki toshiya Thu, 14 Nov 2013 23:26:58 -0800

I'm trying to fix this issue by an insertion of myXmlTokenReplace()
into printInfoString().


Regards,
mpsuzuki

On 11/14/2013 10:42 PM, Paweł Leń wrote:

This is the contents of file output.xml generated by command pdftotext -bbox 
-htmlmeta 'myfile.pdf' 'output.xml' :

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title>Microsoft Word - 
Preface&Contents_Advances_in_Lasers_and_Electro_Optics.doc</title>
<meta name="Author" content="Teodora"/>
<meta name="Creator" content="PScript5.dll Version 5.2.2"/>
<meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<doc>
   <page width="482.000000" height="680.000000">
     <word xMin="255.120000" yMin="190.576860" xMax="338.055540" 
yMax="207.269700">Advances</word>
     <word xMin="344.000562" yMin="190.576860" xMax="359.331702" 
yMax="207.269700">in</word>
     <word xMin="365.276724" yMin="190.576860" xMax="425.239584" 
yMax="207.269700">Lasers</word>
     <word xMin="256.260624" yMin="207.256884" xMax="288.954240" 
yMax="223.949724">and</word>
     <word xMin="294.884844" yMin="207.256884" xMax="363.168492" 
yMax="223.949724">Electro</word>
     <word xMin="369.099096" yMin="207.256884" xMax="425.265216" 
yMax="223.949724">Optics</word>
   </page>
</doc>
</body>
</html>


As You can see in line 3 tag <title> contains invalid character squence with 
"&".  The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars 
is needed.




*--
*

*Paweł Leń*



2013/11/14 suzuki toshiya <[email protected] 
<mailto:[email protected]>>

    Hi,

    If you could post a sample XML file that you modified the
    output of pdftotext to fit the XML parser, it would be
    helpful for some kind people to develop a patch.

    Regards,
    mpsuzuki


    On 11/14/2013 10:04 PM, Paweł Leń wrote:

        Hello,

        I have error when running:
        pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'

        The output xml have <title> tag on the begining of document (meta section), error 
appears when title contains "&" character. Title field has no CDATA and it is not 
quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?

        Beast regards

        *--
        *

        *Paweł Leń*



        _________________________________________________
        poppler mailing list
        [email protected] <mailto:[email protected]>
        http://lists.freedesktop.org/__mailman/listinfo/poppler 
<http://lists.freedesktop.org/mailman/listinfo/poppler>


_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] XML syntax error in PdfToText tool

Reply via email to