U+FDF2 Ligature and Font Problems

Brian Carrier Wed, 17 Sep 2008 14:39:37 -0700

We have found some problems with PDFBox and its handling of the U+FDF2 ligature. We have a fix, but I'm first e-mailing to see ifsomeone knows if there is a more elegant solution by analyzing thefont data.


Background:

The Unicode U+FDF2 Arabic ligature is defined in the Unicode spec tobe four characters (alef-lam-lam-heh).

http://www.fileformat.info/info/unicode/char/fdf2/index.htm

However, some fonts choose to use the ligature for only the lam-lam-heh and do not include the alef.

http://encyclopedie-en.snyke.com/articles/arabic_alphabet.html#Ligatures

When those fonts are used in a PDF file, an independent alef is addedso that alef-U+FDF2 is stored.



Problem:

Currently, PDFBox will produce alef-alef-lam-lam-heh for fonts thatconsider U+FDF2 to be lam-lam-heh because it follows the PDF spec anddoes not know which fonts break it up into 3 versus 4 characters.



My current solution is a two phased approach:

1) PDFStreamEngine.showString(): Has a hard coded list of fonts thatare known to use the 3 character form and replaces U+FDF2 with lam-lam-heh. For other fonts, U+FDF2 remains.2) PDFTextStripper.flushText(): Looks for the sequence of alef plusthe U+FDF2 ligature. This should happen only if the U+FDF2 issupposed to replaced by only lam-lam-heh. So, we remove the extraalef. This is a fallback for fonts that should have been caught instep 1, but that we do not know about.

This works for our test files, but it does not seem like the cleanestsolution. Ideally, we should be looking at each font and querying itto see if was going to replace the U+FDF2 with lam-lam-heh or alef-lam-lam-heh. Anyone know if this is possible in PDFBox?

Adobe Reader is able to figure it out when we copy and paste the textfrom the PDF file.


thanks,
brian

U+FDF2 Ligature and Font Problems

Reply via email to