On Wed, 01 Dec 2010 17:41:17 +0100
stephan <[email protected]> wrote:

> 
> >> There's one other issue that should be considered at some stage: 
> >> normalization and the fact that a single "character" can be constructed 
> >> from several code points. (acutes and such)
> >
> > This is my next little project. May build on Steve's job. (But it's not 
> > necessary, dchar is enough as a base, I guess.)
> >
> 
> Hi Denis, you might want to consider helping us out.
> 
> We have got a feature-complete Unicode normalization, case-folding, and 
> concatenation implementation passing all test cases in 
> http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then 
> some) for all recent Unicode versions. This code was part of a bigger 
> project that we have stopped working on.
> 
> We feel that the Unicode normalization part might be useful to others. 
> Therefore we consider releasing them under an open source license. 
> Before we can do so, we have to clean up things a bit. Some open issues are
> 
> a)    The code still contains some TODOs and FIXMEs (bugs, 
> inefficiencies, some bigger issues like more efficient storing of data 
> etc.).
> 
> b)    No profiling and no benchmarking against the ICU implementation 
> (http://site.icu-project.org/) has been done yet (we expect surprises).
> 
> c)    Implementation of additional Unicode algorithms (e.g. full case 
> mapping, matching, collation).
> 
> Since we have stopped working on the bigger project, we haven’t made 
> much progress. Any help would be welcome. Let me know whether this would 
> be of interest to you.

Yes, of course it would be useful. in any case. Either you wish to go on your 
project, and I may be of some help. Or it would anyway be a useful base or 
example of how to implement unicode algorithm. Maybe it's time to give some 
more information of what I intend to write. I have done it already (partially 
in Python, nearly completely in Lua).

What I have in mind is a "UText" type that provides the right abstraction for 
text processing / string maipulation as one has when dealing with ASCII (in any 
fact any legacy character set). All what is needed is having a true one-to-one 
mapping between characters (in the common sense) and elements of strings (what 
I call "code stacks"); one given stack unambiguously denotes one character. To 
reach this point, in addition to decoding (ag from utf8 to code points), we 
must:
* group codes into stacks 
* normalize (into 'NFD')
* sorts points in stacks
That's the base.

Then, we can for instance index or slice in O(1) as usual, and get a consistent 
substring of _characters_ (not "abstract characters"). We can search for 
substrings by simple, direct, comparisons. When dealing with utf32 strings (or 
worse utf8), simple indexing or counting is O(n) or rather O(k.n) where k 
represents the (high) cost of "stacking", and normalizing and sorting, on the 
fly -- it's not only traversing the whole string instead of random, it's heavy 
computation all along the way.
From this base, all kinds of usual routines can be built without any more 
complexity. That's all what I want do implement. I wish to write all 
general-purpose ones (which means, for instance, nothing like casing).

Precisely, I do not want to deal with anything related to script-, language-, 
locale- specific issues. It's a completely separate & independant topic. This 
indeed include the "compatibility" normalisation forms of unicode (which 
precisely do not provide a normal form...). It seems part of your project was 
to cope such issues.

I would be happy to cooperate if you feel like going on (then, let us 
communicate off list). I still have the Lua code (which used to run); even if 
useless as help for implementation (the languages are too different), it could 
give some more concrete picture of what I have in mind. Also, it includes 
several test datasets, reprocessed for usability, from unicode's online files.


Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Reply via email to