Re: U+FDF2 Ligature and Font Problems

Brian Carrier Mon, 22 Sep 2008 14:05:56 -0700

Hi,

On Sep 20, 2008, at 5:41 PM, Jukka Zitting wrote:

Hi,

On Wed, Sep 17, 2008 at 11:38 PM, Brian Carrier
<[EMAIL PROTECTED]> wrote:
My current solution is a two phased approach:
1) PDFStreamEngine.showString(): Has a hard coded list of fontsthat areknown to use the 3 character form and replaces U+FDF2 with lam-lam-heh. For
other fonts, U+FDF2 remains.
2) PDFTextStripper.flushText(): Looks for the sequence of alefplus theU+FDF2 ligature. This should happen only if the U+FDF2 issupposed to
replaced by only lam-lam-heh. So, we remove the extra alef. This is a
fallback for fonts that should have been caught in step 1, butthat we do
not know about.

This works for our test files, but it does not seem like the cleanest
solution. Ideally, we should be looking at each font and queryingit to seeif was going to replace the U+FDF2 with lam-lam-heh or alef-lam-lam-heh.
Anyone know if this is possible in PDFBox?
The information should be available since you _are_ getting the
lam-lam-heh sequence out of the font. For example, when a font is
loaded (or first used) we could automatically try to encode the U+FDF2
character and record the result somewhere.

Currently, the breaking up of the U+FDF2 into the individualcharacters is being done by applying the NFKC unicode normalization(see the patch in PDFBOX-377). That process is following theUnicode spec though, which not all fonts follow. I don't know ifPDFBox has knowledge about if a font encodes the ligature as three orfour characters.

Instead of modifying PDFStreamEngine and PDFTextStripper, would it
make more sense to encapsulate this logic already in PDFont.encode()?
This way the logic would be where it should be, i.e. associated with
the font.

It could be. I was going to make the change there, but PDFont.encode() returns single characters now and I wasn't sure if this wouldeffect other code if I started to return strings. Before I researchedthe more global impacts and the exact place for my fix, I wanted tosee if there as a better fix that used the font data.


thanks,
brian

Re: U+FDF2 Ligature and Font Problems

Reply via email to