On 2/5/20, Albert Astals Cid <aa...@kde.org> wrote: > El dimecres, 5 de febrer de 2020, a les 12:20:10 CET, Albretch Mueller va > escriure: >> pdftotext has the option >> >> -layout : maintain original physical layout >> >> but pdftohtml doesn't > > pdftotext and pdftohtml use different code/algorithms
that explains it. Thank you. I thought I was missing something > you'd have to see if > one can be adapted/improved for the other. Well, yes. Definitely the way to go. You will have to "go monkey" and employ a bit of heuristics to make pdfto* dance it well for you. If you know that most documents will be of the multi-column kinds: 1) run pdftotext with and with out layout 2) some line by line analysis of the result of both 3) pdftohtml 4) do some line by line algorithmic consolidation of all three texts based on §1, §2, §3 that should do it! I will post the link to the code here once I am done with it lbrtchx _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler