I'm glad to help you! Sorry for delay in response. At Kevin's request I created small pdf with LibreOffice PDF Export feature (it's attached as 'example.pdf'). But when I executed test code given in original message, something strange happened - 'Hello, world!' text is printed (with Console.WriteLine()) shuffled. Debugging showed me next results:
1) LibreOffice exports text is small chunks (~1-3 characters in chunk) 2) Font 'Times New Roman' used in LibreOffice has name 'BAAAAA+TimesNewRomanPSMT' and BuiltinFonts14 does not contain this font name. Also font description has no encoding specified. Widths array has only few values for special characters. So finally I have zero widths for all characters. Also I tried to use 'Embed standart fonts' option in 'File -> Export As Pdf' but that did not help. 3) Then PdfContentStreamProcessor.DisplayPdfString() method does not translate textMatrix because renderInfo.GetUnscaledWidth() returns 0. So in combination with PdfContentStreamProcessor.ApplyTextAdjust() method which after each text chunk slightly translates textMatrix it shuffles small text chunks mentioned in 1). Bugfix that I suggested fixes the vertical shuffling so if one chunk located below another it's sorted right because of TextMoveStartNextLine content operator (test 'example2.pdf' with and without bugfix). But if real width (difference between chunk startLocations) of chunks is lesser then absolute value of ApplyTextAdjust() translation than chunks are shuffled. Pdf document that i tested in original message consists of word or few word sized chunks so there are no horizontal shuffling. Also it's not well structured so SimpleTextExtractionStrategy does not give acceptable result (but that strategy works well in 'example.pdf'). Unfortunatly I can not attach it because it's real document with personal information. I'll try to make similar document later (I tried to do this with LibreOffice but failed because of small text chunks). So there are my another suggestion: if pdf document uses font which encoding is unknown and font is not in BuiltinFonts14 then DocumentFont.stdEnc is used as encoding. So why do not you use some default widths instead of zero widths? Sorry for big message. Most problems is not yours at all, but I wanted to describe researchs. May be it will be usefull.
example.pdf
Description: Adobe PDF document
example2.pdf
Description: Adobe PDF document
------------------------------------------------------------------------------ Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
