On 08/06/2008, at 10:29 AM, Ross Moore wrote:

> With the examples that I have tried, the best results are
> obtained using:   pdftotext -raw
>
> For example, on a slightly extended version of the PDF
> from my previous posting, using  -raw  gives (correctly):
>
>       für Löwen und Agnés
>
> whereas not using -raw  gives either:
>
>      fur Lowen und Agnes
>                        or
>      fur Lowen und Agnes

The bare accent characters have been stripped in the email,
from the above lines. Here's a different representation:

      fur Lowen und Agnes
       <CC><88> <CC><88>                 <CC><81>

                       or

     fur Lowen und Agnes <CC><88> <CC><88> <CC><81>

>
> according to whether  -layout  is used, or not.
> ( -raw  seems to override  -layout  so there
> is no need to look at 4 separate cases.)

> There could be a switch to tell  pdftotext  to swap the order
> of the accent character and the letter; but this isn't sufficiently
> general to cope with all cases. For example, TeX has traditionally
> placed over-accents before the letter, but under-accents after it.
> And what about having multiple diacritic marks on the same letter?
>
> Also, the "dot under" and "underbar" accents are produced by
> placing the same character as used for "dot above" and "macron"
> diacritics, but positioned below the letter.
>
> Thus there are several issues that need to be handled to get the
> "correct" text extraction from such PDFs.

That is, both the layout and the original stream order
must be considered, perhaps also using extra knowledge
of how the PDF was generated.


Hope this helps,

        Ross

------------------------------------------------------------------------
Ross Moore                                       [EMAIL PROTECTED]
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------



_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to