Hello, first of all many thanks for your excellent work.
I want to extract the text from a document or a pdf page. The text order should be the same as follows by a reader. This tasks becomes difficult for multi-column document and for tables. As I want to format the paragraphs, I cannot use makeWordList. I would go through TextFlow, TextBlock, Lines and Words. But I cannot obtain the right order for a complex document such as: http://doc.rero.ch/lm.php?url=1000,43,2,20101130144841-EO/mue_dmc.pdf Do you have any strategies to re-order the blocks? Do the file contains informations about the right sequence. As acroread, evince, and apple preview behave different, I can conclude that it is not trivial. Am I right? Many thanks in advance. ---------------------------------------------------------------------- Johnny Mariéthoz RERO, Av. de la Gare 45, CH - 1920 MARTIGNY Téléphone: +41(0)27 721 8579 Fax : +41(0)27 721 8586 Web : http://www.rero.ch ReroDoc : http://doc.rero.ch, [email protected] ---------------------------------------------------------------------- _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
