Hi I have only recently started using PDFBox, primarily for extracting text from PDFs. So apologies if I use the wrong terminology or not too clear on some points.
It works pretty well giving close to a perfect text stream as I could expect, with one minor problem. It would occasionally jumble up letters even with sort by position enabled. I did search the mailing lists etc and found a few reference to what I believe may be similar or related problems. I have been able to fix my problem with some minor patching of the code, however it was just done to experiment with what I thought might be the cause, I just wanted to let you know what I found and if there is interest in the change I can send them on to anyone if interested. It looks like the PDFs I had trouble with were including the same font based on the descriptor multiple times but in each instance with a varying set of characters and attributes such as the widths array, first char, last char, bounding box, cap height, stemv etc. Having read the PDF spec it seems these should be defined with a prefix to the font name to indicate they are subsets, these however were not. I tried a simple fix of getting the font look up to be based on the font resource rather than the name from the font cache. This didnt work out. So then I opted for merging fonts whenever it was already found in the cache. Based on those attributes mentioned before it combines widths arrays preserving what was already there and adding any non zero values to it, whilst aligning them based on the first and last character values to create the super set. It adjusts the first and last char also. I added code to maintain the max capheight and stemv values (this didnt appear to make any difference to my output). This change resulted in the text output being corrected in several places, with no additional errors being introduced. I figured I'd let you know that in some cases subsets are declared incorrectly but can it seems be combined nonetheless to give better results. So if this proves useful and you'd like to see the code let me know, I may get a chance to clean it up before sending it on. Tony