+1 to ICU. This is not very easy to build (I've build one). The 'Java
Normalizer' is not until 1.6.

On Sat, May 9, 2009 at 6:15 PM, Michael Glavassevich <[email protected]>wrote:

> Hi Richard,
>
> Richard Kelly <[email protected]> wrote on 05/08/2009 02:12:56 AM:
>
> > Hi everyone,
> >
> > Just thought I would give an update on how I've been preparing for my
> > GSoC work.  I managed to get my environment set up and I've been
> > building some basic XNI components to get a feel for the code.
>
>
> Sounds good.
>
>
> > I've also been researching the different
> > options of implementing the Unicode normalization functions.
> > Here are some pros/cons of the various approaches that I've thought of:
> >
> > ICU4J: [1]
> > (This is effectively the reference implementation of unicode
> normalization)
> > Pros:
> >   - Currently compiles with Java 1.3
> >   - Is fully tested with all the exception
> >   - Implements 'quick check' optimizations which allows you to pass
> > documents many times faster.
> >   - License seems to be compatible with Xerces license.
>
> Yes, I think it is. It's been reviewed before on the legal-discuss list [3]
> and I believe there are other Apache projects (e.g. Harmony [4]) that
> already bundle it.
>
>
> >   - Normalization code can be built as a modular component, so you
> > don't need the whole ICU4J library.
> > Cons:
> >   - Future versions of ICU4J are not guaranteed to compile Java 1.3 in
> > future versions
> >   - requires an additional license file to be added to the distribution
> >   - adds a ~500kb jar file to the build
> >
> >
> > Java Normalizer [2]
> > Pros:
> >   - No additional libraries needed.
> >   - Functionality built into java so smaller file size.
> >   - No license required.
> > Cons:
> >   - Not available until Java 1.4+
> >   - Doesn't implement 'quick check' optimizations so its much slower.
> >
> >
> > Build from scratch:
> > Pros:
> >   - Complete control of source code
> >   - Can ensure that code compiles with Java 1.3
> > Cons:
> >   - Although the main functionality is fairly straight-forward,
> >     some legacy Unicode requirements and edge cases make implementing the
> code
> >     pretty complicated.
> >   - Additional code maintenance if unicode standards change
> >
> >
> >
> > I am leaning towards the first option (ICU4J) but welcome any other
> > input / comments before I decide.
>
> +1. I think that's the best choice of the three you've presented. No sense
> reinventing the wheel (building it from scratch) if we don't need to and
> can't depend on [2] because it's only available in Java 6.
>
> > In this case, since the ICU4J license needs to be attached, would itbe
> okay to
> > create a text file called "LICENSE.normalizer" to handle this
> requirement?
>
> Yes, that's exactly how we handle the licenses for other dependencies. It
> should get included in the packages produced by the build.
>
> > Thanks,
> > - Richard
> >
> > [1] http://site.icu-project.org/
> > [2] http://java.sun.com/javase/6/docs/api/java/text/Normalizer.html
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
> Thanks.
>
> [3] http://markmail.org/thread/rkdg4u5ziusxnqat
> [4] http://harmony.markmail.org/search/?q=ICU
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [email protected]
> E-mail: [email protected]
>

Reply via email to