I'm glad to help you! Sorry for delay in response.
At Kevin's request I created small pdf with LibreOffice PDF Export
feature (it's attached as 'example.pdf').
But when I executed test code given in original message, something
strange happened - 'Hello, world!' text is printed (with
Console.WriteLine()) shuffled. Debugging showed me next results:

1) LibreOffice exports text is small chunks (~1-3 characters in chunk)
2)  Font 'Times New Roman' used in LibreOffice has name
'BAAAAA+TimesNewRomanPSMT' and BuiltinFonts14 does not contain this
font name. Also font description has no encoding specified. Widths
array has only few values for special characters. So finally I have
zero widths for all characters. Also I tried to use 'Embed standart
fonts' option in 'File -> Export As Pdf' but that did not help.
3) Then PdfContentStreamProcessor.DisplayPdfString() method does not
translate textMatrix because renderInfo.GetUnscaledWidth() returns 0.
So in combination with PdfContentStreamProcessor.ApplyTextAdjust()
method which after each text chunk slightly translates textMatrix it
shuffles small text chunks mentioned in 1).

Bugfix that I suggested fixes the vertical shuffling so if one chunk
located below another it's sorted right because of
TextMoveStartNextLine content operator (test 'example2.pdf' with and
without bugfix). But if real width (difference between chunk
startLocations) of chunks is lesser then absolute value of
ApplyTextAdjust() translation than chunks are shuffled.

Pdf document that i tested in original message consists of word or few
word sized chunks so there are no horizontal shuffling. Also it's not
well structured so SimpleTextExtractionStrategy does not give
acceptable result (but that strategy works well in 'example.pdf').
Unfortunatly I can not attach it because it's real document with
personal information. I'll try to make similar document later (I tried
to do this with LibreOffice but failed because of small text chunks).

So there are my another suggestion:
if pdf document uses font which encoding is unknown and font is not in
BuiltinFonts14 then DocumentFont.stdEnc is used as encoding. So why do
not you use some default widths instead of zero widths?

Sorry for big message. Most problems is not yours at all, but I wanted
to describe researchs. May be it will be usefull.

Attachment: example.pdf
Description: Adobe PDF document

Attachment: example2.pdf
Description: Adobe PDF document

------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to