Hi Tony, Betreff: Font Subsets and Merging Gesendet: Fr, 14. Aug 2009 Von: Tony Scerri<tony.sce...@gmail.com>
> Hi > > I have only recently started using PDFBox, primarily for extracting text > from PDFs. So apologies if I use the wrong terminology or not too clear on > some points. > > It works pretty well giving close to a perfect text stream as I could > expect, with one minor problem. It would occasionally jumble up letters > even > with sort by position enabled. I did search the mailing lists etc and found > a few reference to what I believe may be similar or related problems. I > have > been able to fix my problem with some minor patching of the code, however > it > was just done to experiment with what I thought might be the cause, I just > wanted to let you know what I found and if there is interest in the change > I > can send them on to anyone if interested. Great! We are interested in every bugfix or improvement. > It looks like the PDFs I had trouble with were including the same font > based > on the descriptor multiple times but in each instance with a varying set of > characters and attributes such as the widths array, first char, last char, > bounding box, cap height, stemv etc. Having read the PDF spec it seems > these > should be defined with a prefix to the font name to indicate they are > subsets, these however were not. I tried a simple fix of getting the font > look up to be based on the font resource rather than the name from the font > cache. This didnt work out. So then I opted for merging fonts whenever it > was already found in the cache. Based on those attributes mentioned before > it combines widths arrays preserving what was already there and adding any > non zero values to it, whilst aligning them based on the first and last > character values to create the super set. It adjusts the first and last > char > also. I added code to maintain the max capheight and stemv values (this > didnt appear to make any difference to my output). > > This change resulted in the text output being corrected in several places, > with no additional errors being introduced. I figured I'd let you know that > in some cases subsets are declared incorrectly but can it seems be combined > nonetheless to give better results. > So if this proves useful and you'd like to see the code let me know, I may > get a chance to clean it up before sending it on. Please create an issue on JIRA [1] and attach your patch and if possible an example to it. If you have any further questions, don't hesitate to ask on the list. Thanks in advance, Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX