DO NOT REPLY [Bug 32789] [PATCH] Arabic Shaping not Supported by FOP

bugzilla Thu, 11 Feb 2010 17:11:12 -0800

https://issues.apache.org/bugzilla/show_bug.cgi?id=32789


--- Comment #12 from Jonathan Levinson <levin...@intersystems.com> 2010-02-12 
01:10:43 UTC ---
Hi Vincent,

Before committing the work I did on Arabic to the trunk, the Apache FOP
organization seems to want five things:

1)    Modify ICU4J change to check if classes available and if not don't call
them

2)    Provide Apache organization with performance data to assess performance
cost of Arabic Shaping classes

3)    Provide Apache organization with better examples of use of Arabic

4)    Move Arabic form shaping and BIDI algorithm to layout manager

5)    Not use ICU4J to do UNICODE transformation but use the standard ligature
mechanism to provide contextual glyphs.  This is a request for a complete
rewrite of the patch to use a mechanism that isn't known to me currently, but
maybe could become known if I had the right pointers.

(4) is highly non-trivial.  I haven’t a clue as to how to do (5).

For (4) could you point me at the source code files in the layout manager that
would have to be changed?  Can you give me some pointers as to where this sort
of information is processed by the layout manager?  I've read the layout
manager code and tried to locate where it processes the width of characters and
what would have to change to have right-to-left printing but I've been unable
to penetrate the forest for the trees.  I have read Knuth's algorithm for line
breaking and I think I have a good understanding of what a KnuthElement is -
glue, penalties and the basics of Knuth's algorithm, but I'm having trouble
converting this theoretical understanding into a practical understanding of
what has to change in the code to move the printing from right to left.

I’m not sure what to do about (5).  Do you have any references, is there some
pointer to what algorithm would do more than UNICODE transformation but would
do contextual glyphs based on the glyphs in a font.  How do I tell the
characteristics of an Arabic character in  a font, whether it is in initial,
intermediate or final position?  I suppose this information would vary from
font to font.  Where in FOP is font information like this processed and how do
I “tell” a font I want the Arabic character at UNICODE position X but I want
from the font that the character be in final position?  Does the layout manager
actually process the font information about a character?  I suppose it must to
know character widths, which are necessary for Knuth's algorithm, but please
forgive me, I don't see where this code lives.  FOP has over 11,000 files!

I used ICU4J to avoid having to write a ton of code.  That is why my patch is
so small.

I'm not complaining.  I'm hoping I can get some more pointers to what changes
need to be made to support Arabic and where the changes have to go.  Even if
I'm not the one who eventually does the work, whoever eventually implements
right-to-left printing and Arabic support will certainly find our discussion
valuable.  I'm sure you'll agree that FOP needs to become truly international
at some point. That would really open a new community of users to the benefits
of FOP, which are considerable.

In fact, I agree that it is hard to see how there can be a robust solution to
the problem of printing Arabic text that simply involves the PDF renderer;
theoretically and probably practically the layout manager has to be involved.

I’ve looked at the FOP SVG rendering code which tries to do Arabic form shaping
and it seems to be just doing UNICODE transformations.  It doesn’t seem to be
responding to the ability of a font that you are discussing, to display a
single UNICODE code in many different forms.  It seems to be just doing a
simple table look up that transforms a UNICODE code.  So you already have code
in FOP, in  your SVG renderer, that seems to do the same thing I tried to do
using ICU4J.  This doesn't mean the code I wrote using ICU4J is doing the right
thing, but it does mean that simply transforming one UNICODE code to another is
the simplest first step in solving this difficult problem.

Could we agree that we could live with an ICU4J approach if (1),(2),(3), and
(4) were met as conditions, and that (5) - a complete rewrite using modern font
techniques could be deferred.  Of course, I'm interested in learning how I
could achieve (5); I'm not dismissing (5), I'm just looking for a bottom-line
that would allow FOP to practically meet the needs of rendering Arabic text,
even if the result isn't perfect yet.  

Best Regards,
Jonathan

(In reply to comment #11)
> Hi Jonathan,
> 
> (In reply to comment #8)
> > Hi Vincent,
> > 
> > I will attach the .fo file I've been using for testing.  I will also attach 
> > the
> > generated pdf.  This is from an example our Dubai team gave me for my own
> > testing as I developed the code.
> 
> Well... It's a bit light for an example. Just a single word...
> 
> 
> > Our Dubai team has been testing with a large variety of Arabic script - but
> > they are using a report creation tool that invokes fop.bat with xsl input so
> > the .fo file isn't part of their output.
> > 
> > I could give them instructions for creating .fo files.
> > 
> > We have found in testing that what is most important is the BIDI algorithm 
> > is
> > applied so that text (including embedded numerals) is in the right order and
> > that form shaping is correct.  You need to know the Arabic alphabet and its
> > rules to assess the output of testing.  We have a team that knows Arabic to 
> > do
> > our testing.  They "eyeball" the reports to make sure they are in proper 
> > Arabic
> > with text and sub-text in the right order.  Embedded numerals can be in a
> > different order - left-to-right rather than right-to-left. It isn't clear 
> > to me
> > how this process can be automated.
> > 
> > You are right that widths change and this could change line breaking 
> > decisions.
> >  Do you know where in the FOP pipeline before we reach the rendering 
> > pipeline
> > the Arabic shaping could go so as to be able to affect width selection?
> 
> Something needs to be done in the layout engine, possibly also on the FO tree.
> At least section 5.8 (“Unicode BIDI Processing”) of XSL-FO 1.1 deserves a look
> as it explains how the Unicode algorithm should be blended in XSL-FO
> processing. Inline-level stuff is likely to be affected. It needs to be seen
> how and when character re-ordering should be done WRT line breaking.
> 
> Also, something might need to be done at the font level. I don't know what
> ICU4J does, but I suspect it replaces characters from the Arabic range
> (U+0600–U+06FF) with ones from Arabic Presentation Forms-A (U+FB50–U+FDFF).
> AFAIU from the Unicode specification this is legacy that may not be supported
> by every font. I suppose modern fonts (especially OpenType ones) use the
> standard ligature mechanism to provide contextual glyphs.
> 
> 
> > I believe that what ensures the right glyphs are embedded in the PDF file is
> > the nature of the ICU4J algorithm which transforms the UNICODE 
> > representation
> > of the string.  The output for our Dubai team is PDFs with embedded fonts 
> > and
> > these are working so ICU4J must have solved the problem in some way, and I
> > believe the way they solve it is by using different UNICODE codes.
> 
> Actually this is taken care of by the font library called by PDFPainter. I
> suspect the same is done at the layout stage, with the standalone glyphs. 
> Which
> would be suboptimal, as both standalone and contextual glyphs would be 
> embedded
> in the final PDF.
> 
> 
> > I don't have performance numbers to give you yet.  If ICU4J was clever about
> > the way they wrote their transform algorithm it should not be much of a
> > performance impact since they only need to transform text in the Arabic 
> > UNICODE
> > code range and testing whether text is in this range should be quick.
> > 
> > Thanks,
> > Jonathan
> > 
> > (In reply to comment #7)
> > > Hi,
> > > Thanks for your patch. Do you have an example FO file that could be used 
> > > for
> > > testing purpose (even better, with an English translation)?
> > > IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> > > suitable ligature glyphs for building words. Surely that affects character
> > > widths, so line breaking decisions? In the patch, shaping is performed at 
> > > the
> > > rendering stage, so isn't there a danger of getting inconsistent results?
> > > Also, IIC Arabic shaping affects glyphs selection. How do you make sure 
> > > that
> > > the right glyphs are being embedded in the PDF file?
> > > The same piece of code is duplicated in the PCL and PDF painters. The same
> > > would probably also need to be done for other painters. This is not 
> > > desirable.
> > > Finally, what is the impact on performance? It looks like shaping will be
> > > applied to just any text, even non-arabic one.
> > > Thanks,
> > > Vincent
> > > (In reply to comment #3)
> > > > Created an attachment (id=24934)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24934) [details]
[details] [details] [details]
> > > > Support for Arabic PDF rendering using ICU4J
> > > > 
> > > > This patch uses ICU4J to do form-shaping and BIDI transformation of 
> > > > rendered
> > > > text.  It is a patch for the FOP trunk.   It does not change the layout 
> > > > manager
> > > > or the area tree handler or allow a writing-mode other than “lr-tb”.   
> > > > For this
> > > > patch to be integrated with FOP, FOP would need to distribute the ICU4J 
> > > > library
> > > > - icu4j-4_2_1.jar.   It affects both PDF and PCL rendering but has only 
> > > > been
> > > > tested with PDF rendering.  So far results of testing with PDF 
> > > > rendering have
> > > > been positive.  The PCL aspect of the patch looks correct given that 
> > > > the PDF
> > > > aspect works.
> 
> 
> Vincent

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 32789] [PATCH] Arabic Shaping not Supported by FOP

Reply via email to