I'm glad to hear that the situation is improved.
At present, I'm unfamiliar about how other characters,
like non-ASCII Latin characters, are embedded by popular
PDF production workflow and how they should be handled
in poppler. If you have similar trouble in future, please
post to this list!

Regards,
mpsuzuki

On 11/29/2013 11:58 PM, Paweł Leń wrote:
Hello :)

Everything works fine, thank You very much!

Best Regards

*--
*

*Paweł Leń*



2013/11/15 suzuki toshiya <[email protected] 
<mailto:[email protected]>>

    How about this?

    Regards,
    mpsuzuki


    On 11/15/2013 04:26 PM, suzuki toshiya wrote:

        I'm trying to fix this issue by an insertion of myXmlTokenReplace()
        into printInfoString().

        Regards,
        mpsuzuki

        On 11/14/2013 10:42 PM, Paweł Leń wrote:

            This is the contents of file output.xml generated by command 
pdftotext -bbox -htmlmeta 'myfile.pdf' 'output.xml' :

            <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/__DTD/xhtml1-transitional.dtd 
<http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>"><__html xmlns="http://www.w3.org/1999/__xhtml 
<http://www.w3.org/1999/xhtml>">
            <head>
            <title>Microsoft Word - 
Preface&Contents_Advances_in___Lasers_and_Electro_Optics.doc<__/title>
            <meta name="Author" content="Teodora"/>
            <meta name="Creator" content="PScript5.dll Version 5.2.2"/>
            <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
            <meta name="CreationDate" content=""/>
            </head>
            <body>
            <doc>
                <page width="482.000000 <tel:482.000000>" height="680.000000 
<tel:680.000000>">
                  <word xMin="255.120000 <tel:255.120000>" yMin="190.576860" xMax="338.055540 
<tel:338.055540>" yMax="207.269700">Advances</__word>
                  <word xMin="344.000562 <tel:344.000562>" yMin="190.576860" xMax="359.331702" 
yMax="207.269700">in</word>
                  <word xMin="365.276724" yMin="190.576860" xMax="425.239584 <tel:425.239584>" 
yMax="207.269700">Lasers</__word>
                  <word xMin="256.260624 <tel:256.260624>" yMin="207.256884" xMax="288.954240" 
yMax="223.949724 <tel:223.949724>">and</word>
                  <word xMin="294.884844 <tel:294.884844>" yMin="207.256884" xMax="363.168492" 
yMax="223.949724 <tel:223.949724>">Electro</word>
                  <word xMin="369.099096" yMin="207.256884" xMax="425.265216 <tel:425.265216>" 
yMax="223.949724 <tel:223.949724>">Optics</word>
                </page>
            </doc>
            </body>
            </html>


            As You can see in line 3 tag <title> contains invalid character squence with 
"&".  The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars 
is needed.




            *--
            *

            *Paweł Leń*



            2013/11/14 suzuki toshiya <[email protected] 
<mailto:[email protected]> <mailto:mpsuzuki@hiroshima-u.__ac.jp 
<mailto:[email protected]>>>

                 Hi,

                 If you could post a sample XML file that you modified the
                 output of pdftotext to fit the XML parser, it would be
                 helpful for some kind people to develop a patch.

                 Regards,
                 mpsuzuki


                 On 11/14/2013 10:04 PM, Paweł Leń wrote:

                     Hello,

                     I have error when running:
                     pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'

                     The output xml have <title> tag on the begining of document (meta 
section), error appears when title contains "&" character. Title field has no CDATA 
and it is not quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?

                     Beast regards

                     *--
                     *

                     *Paweł Leń*



                     ___________________________________________________
                     poppler mailing list
            [email protected] <mailto:[email protected]> 
<mailto:poppler@lists.__freedesktop.org <mailto:[email protected]>>
            http://lists.freedesktop.org/____mailman/listinfo/poppler 
<http://lists.freedesktop.org/__mailman/listinfo/poppler> 
<http://lists.freedesktop.org/__mailman/listinfo/poppler 
<http://lists.freedesktop.org/mailman/listinfo/poppler>>




        _________________________________________________
        poppler mailing list
        [email protected] <mailto:[email protected]>
        http://lists.freedesktop.org/__mailman/listinfo/poppler 
<http://lists.freedesktop.org/mailman/listinfo/poppler>




_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to