Hi Tony,

Betreff: Font Subsets and Merging
Gesendet: Fr, 14. Aug 2009
Von: Tony Scerri<tony.sce...@gmail.com>

> Hi
> 
> I have only recently started using PDFBox, primarily for extracting text
> from PDFs. So apologies if I use the wrong terminology or not too clear on
> some points.
> 
> It works pretty well giving close to a perfect text stream as I could
> expect, with one minor problem. It would occasionally jumble up letters
> even
> with sort by position enabled. I did search the mailing lists etc and found
> a few reference to what I believe may be similar or related problems. I
> have
> been able to fix my problem with some minor patching of the code, however
> it
> was just done to experiment with what I thought might be the cause, I just
> wanted to let you know what I found and if there is interest in the change
> I
> can send them on to anyone if interested.
Great! We are interested in every bugfix or improvement.

> It looks like the PDFs I had trouble with were including the same font
> based
> on the descriptor multiple times but in each instance with a varying set of
> characters and attributes such as the widths array, first char, last char,
> bounding box, cap height, stemv etc. Having read the PDF spec it seems
> these
> should be defined with a prefix to the font name to indicate they are
> subsets, these however were not. I tried a simple fix of getting the font
> look up to be based on the font resource rather than the name from the font
> cache. This didnt work out. So then I opted for merging fonts whenever it
> was already found in the cache. Based on those attributes mentioned before
> it combines widths arrays preserving what was already there and adding any
> non zero values to it, whilst aligning them based on the first and last
> character values to create the super set. It adjusts the first and last
> char
> also. I added code to maintain the max capheight and stemv values (this
> didnt appear to make any difference to my output).
> 
> This change resulted in the text output being corrected in several places,
> with no additional errors being introduced. I figured I'd let you know that
> in some cases subsets are declared incorrectly but can it seems be combined
> nonetheless to give better results.
> So if this proves useful and you'd like to see the code let me know, I may
> get a chance to clean it up before sending it on.
Please create an issue on JIRA [1] and attach your patch and if possible an 
example to it.

If you have any further questions, don't hesitate to ask on the list.

Thanks in advance,
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX

Reply via email to