Re: [poppler] line brakes and layout for multi-column texts ...

Albretch Mueller Thu, 06 Feb 2020 03:20:20 -0800

On 2/5/20, Albert Astals Cid <aa...@kde.org> wrote:
> El dimecres, 5 de febrer de 2020, a les 12:20:10 CET, Albretch Mueller va
> escriure:
>>  pdftotext has the option
>>
>> -layout              : maintain original physical layout
>>
>>  but pdftohtml doesn't
>
> pdftotext and pdftohtml use different code/algorithms


 that explains it. Thank you. I thought I was missing something

> you'd have to see if
> one can be adapted/improved for the other.

 Well, yes. Definitely the way to go. You will have to "go monkey" and
employ a bit of heuristics to make pdfto* dance it well for you. If
you know that most documents will be of the multi-column kinds:

 1) run pdftotext with and with out layout
 2) some line by line analysis of the result of both
 3) pdftohtml
 4) do some line by line algorithmic consolidation of all three texts
based on §1, §2, §3

 that should do it!

 I will post the link to the code here once I am done with it

 lbrtchx
_______________________________________________
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] line brakes and layout for multi-column texts ...

Reply via email to