Re: Unicode "Alphabetic" characters

2006-02-28 Thread Manuel Mall
On Wednesday 01 March 2006 06:37, Jeremias Maerki wrote:

> Does anyone of you plan to work on the UAX#14 stuff?

I would love to. See 
http://marc.theaimsgroup.com/?l=fop-dev&m=113074361626846&w=2 where I 
documented the work I did with Joerg's code. Same as Joerg I would 
prefer a poll solution (not the BreakIterator solution) as IMO it would 
be easier to integrate into FOP. Especially as there are likely to be 
some custom modifications required. For example:
* UAX#14 breaks at the end of a sequence of white space while Knuth 
breaks at the beginning of a sequence of white space.
* We need to take the white space handling related properties into 
account when feeding character pairs in the UAX#14 algorithm
* We need to take borders/padding (nested inlines) into account with 
respect to line breaking
* Using the poll approach would also allow us to integrate the 
determination of hyphenation break points into the same loop avoiding 
the need to iterate of sequences of text multiple times.

>
> Jeremias Maerki

Manuel


Re: Unicode "Alphabetic" characters

2006-02-28 Thread Jeremias Maerki
:-) Thanks.

On 01.03.2006 00:02:59 J.Pietschmann wrote:
> Jeremias Maerki wrote:
> > On 28.02.2006 21:48:27 J.Pietschmann wrote:
> >> If all else fails, do the same as for the line breaking properties.
> > 
> > Sorry, but I don't understand what you mean.
> 
> Generate the necessary data tables directly from the Unicode
> source, automatically or manually. Yes, NIH appears to raise
> its ugly head...
> 
> J.Pietschmann



Jeremias Maerki



Re: Unicode "Alphabetic" characters

2006-02-28 Thread J.Pietschmann

Jeremias Maerki wrote:

Did you see that there's a BreakIterator in ICU4J?


Ooops, missed that. Thank you for the correction.


If you guys tell me that it would be
worthwhile to take this library (or parts of it) aboard, I'm fine with
it.


Well, for line breaking I don't see an advantage over using the Java
BreakIterator (other than the latter not being available in Java 1.3).
I'd prefer to have a more "pull style" interface to the line break
finder though.

J.Pietschmann


Re: Unicode "Alphabetic" characters

2006-02-28 Thread J.Pietschmann

Jeremias Maerki wrote:

On 28.02.2006 21:48:27 J.Pietschmann wrote:

If all else fails, do the same as for the line breaking properties.


Sorry, but I don't understand what you mean.


Generate the necessary data tables directly from the Unicode
source, automatically or manually. Yes, NIH appears to raise
its ugly head...

J.Pietschmann


Re: Unicode "Alphabetic" characters

2006-02-28 Thread Jeremias Maerki

On 28.02.2006 21:48:27 J.Pietschmann wrote:
> Jeremias Maerki wrote:
> > Yep, that's exactly what I need. But hey, adding a 3MB library just for
> > this method, that is a little much.
> 
> If all else fails, do the same as for the line breaking properties.

Sorry, but I don't understand what you mean.


Jeremias Maerki



Re: Unicode "Alphabetic" characters

2006-02-28 Thread Jeremias Maerki

On 28.02.2006 22:16:05 J.Pietschmann wrote:
> Simon Pepping wrote:
> > It aims to be _the_ Java access library to Unicode. As FOP becomes
> > more Unicode aware, can we do without it? Perhaps it also has anything
> > on UAX#14, line breaking?
> 
> It has the tables, but not the algorithm.

Did you see that there's a BreakIterator in ICU4J?
http://icu.sourceforge.net/userguide/boundaryAnalysis.html
This page claims that ICU supports TR14 (=UAX#14?). Most of what is
available in the C version also seems to have been ported to Java.

> Java has already the BreakIterator as algorithm implementation, but
> no direct access to the line breaking properties itself, which thwarts
> attempts to have an alternative implementation based on already
> available data :-(
> Same for BIDI :-(, although Java's interface to the BIDI related
> algorithms are better than the BreakIterator attempt, fortunately.
> 
> > Is there a way to make it an optional extension to java's Unicode
> > support, to be installed by those users who want to use Unicode
> > features in FOP that go beyond the ordinary? Most such users may
> > already have it installed.
> 
> I'd go for a pluggable "algorithm providers", as already proposed
> several times for various purposes. Algorithms based on ICU presence
> could be preferred, while falling back to a more crude implementation
> if ICU is not presend.
> 
> BTW ICU has lots of other interesting features relevant for I18N,
> look for example at the calender section or the number formatting.

It looks like ICU4J is a nice toolbox with many little wonders. Even a
guide to scaling down the library [1]. :-) Well, we probably don't need
to calendar section, do we? If you guys tell me that it would be
worthwhile to take this library (or parts of it) aboard, I'm fine with
it. I just want to be sure that it's for more than just determining the
class of certain characters. Does anyone of you plan to work on the
UAX#14 stuff? The license of ICU4J should be ok. It's almost the same as
the already approved X.Net license. Just needs a sanity check with the
VP Legal Affairs.

[1] 
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu4j/readme.html#HowToModularize


Jeremias Maerki



Re: Unicode "Alphabetic" characters

2006-02-28 Thread J.Pietschmann

Simon Pepping wrote:

It aims to be _the_ Java access library to Unicode. As FOP becomes
more Unicode aware, can we do without it? Perhaps it also has anything
on UAX#14, line breaking?


It has the tables, but not the algorithm.

Java has already the BreakIterator as algorithm implementation, but
no direct access to the line breaking properties itself, which thwarts
attempts to have an alternative implementation based on already
available data :-(
Same for BIDI :-(, although Java's interface to the BIDI related
algorithms are better than the BreakIterator attempt, fortunately.


Is there a way to make it an optional extension to java's Unicode
support, to be installed by those users who want to use Unicode
features in FOP that go beyond the ordinary? Most such users may
already have it installed.


I'd go for a pluggable "algorithm providers", as already proposed
several times for various purposes. Algorithms based on ICU presence
could be preferred, while falling back to a more crude implementation
if ICU is not presend.

BTW ICU has lots of other interesting features relevant for I18N,
look for example at the calender section or the number formatting.

J.Pietschmann



Re: Unicode "Alphabetic" characters

2006-02-28 Thread Simon Pepping
On Tue, Feb 28, 2006 at 08:56:22PM +0100, Jeremias Maerki wrote:
> Yep, that's exactly what I need. But hey, adding a 3MB library just for
> this method, that is a little much. Hmm. fop.jar is only 1.7MB. :-) But
> ICU4J looks interesting.

It aims to be _the_ Java access library to Unicode. As FOP becomes
more Unicode aware, can we do without it? Perhaps it also has anything
on UAX#14, line breaking?

Is there a way to make it an optional extension to java's Unicode
support, to be installed by those users who want to use Unicode
features in FOP that go beyond the ordinary? Most such users may
already have it installed.

The license seems to allow us to repackage it, and we could package a
cutdown version that would satisfy FOP's needs. There are a few huge
resource files in impl/data.

Regards, Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: Unicode "Alphabetic" characters

2006-02-28 Thread J.Pietschmann

Jeremias Maerki wrote:

Yep, that's exactly what I need. But hey, adding a 3MB library just for
this method, that is a little much.


If all else fails, do the same as for the line breaking properties.

J.Pietschmann


Re: Unicode "Alphabetic" characters

2006-02-28 Thread Jeremias Maerki
Yep, that's exactly what I need. But hey, adding a 3MB library just for
this method, that is a little much. Hmm. fop.jar is only 1.7MB. :-) But
ICU4J looks interesting.

On 28.02.2006 20:39:16 Simon Pepping wrote:
> On Tue, Feb 28, 2006 at 05:50:29PM +0100, Jeremias Maerki wrote:
> > As you know, I'm currently looking into adding support for fixed spaces.
> > This had some effects on the handling of letter- and word-spacing. I've
> > got word-spacing together with fixed spaces working in the meantime but
> > I'm tracking down a problem with letter spacing. 7.16.2 says that only
> > the characters that are marked by Unicode as "Alphabetic" are eligible
> > for space-start and space-end traits from the letter-spacing property.
> > I'm trying to come up with an isAlphabetic(char ch) method in
> > CharUtilities but I'm having trouble there. "Alphabetic" [1] consists of
> > characters from several "general categories" and from "Other_Alphabetic".
> > The general categories are easy (see java.lang.Character's constants and
> > its getType(char) method). Does anyone have suggestions how best to
> > identify characters from "Other_Alphabetic"? I found a list in the
> > Unicode database which lists all the character ranges that make up
> > Other_Alphabetic but maybe there's already something in Java that I can
> > use. Thanks.
> > 
> > [1] http://www.unicode.org/Public/UNIDATA/UCD.html#Alphabetic
> 
> Did you have a look at the ICU software: http://icu.sourceforge.net/?
> Esp. this static constant:
> http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/lang/UProperty.html#ALPHABETIC,
> looks promising.



Jeremias Maerki



Re: Unicode "Alphabetic" characters

2006-02-28 Thread Simon Pepping
On Tue, Feb 28, 2006 at 05:50:29PM +0100, Jeremias Maerki wrote:
> As you know, I'm currently looking into adding support for fixed spaces.
> This had some effects on the handling of letter- and word-spacing. I've
> got word-spacing together with fixed spaces working in the meantime but
> I'm tracking down a problem with letter spacing. 7.16.2 says that only
> the characters that are marked by Unicode as "Alphabetic" are eligible
> for space-start and space-end traits from the letter-spacing property.
> I'm trying to come up with an isAlphabetic(char ch) method in
> CharUtilities but I'm having trouble there. "Alphabetic" [1] consists of
> characters from several "general categories" and from "Other_Alphabetic".
> The general categories are easy (see java.lang.Character's constants and
> its getType(char) method). Does anyone have suggestions how best to
> identify characters from "Other_Alphabetic"? I found a list in the
> Unicode database which lists all the character ranges that make up
> Other_Alphabetic but maybe there's already something in Java that I can
> use. Thanks.
> 
> [1] http://www.unicode.org/Public/UNIDATA/UCD.html#Alphabetic

Did you have a look at the ICU software: http://icu.sourceforge.net/?
Esp. this static constant:
http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/lang/UProperty.html#ALPHABETIC,
looks promising.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Unicode "Alphabetic" characters

2006-02-28 Thread Jeremias Maerki
As you know, I'm currently looking into adding support for fixed spaces.
This had some effects on the handling of letter- and word-spacing. I've
got word-spacing together with fixed spaces working in the meantime but
I'm tracking down a problem with letter spacing. 7.16.2 says that only
the characters that are marked by Unicode as "Alphabetic" are eligible
for space-start and space-end traits from the letter-spacing property.
I'm trying to come up with an isAlphabetic(char ch) method in
CharUtilities but I'm having trouble there. "Alphabetic" [1] consists of
characters from several "general categories" and from "Other_Alphabetic".
The general categories are easy (see java.lang.Character's constants and
its getType(char) method). Does anyone have suggestions how best to
identify characters from "Other_Alphabetic"? I found a list in the
Unicode database which lists all the character ranges that make up
Other_Alphabetic but maybe there's already something in Java that I can
use. Thanks.

[1] http://www.unicode.org/Public/UNIDATA/UCD.html#Alphabetic

Jeremias Maerki