On Sun, 21 Nov 2010 21:26:53 -0500
Michel Fortin <michel.for...@michelf.com> wrote:

> On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu 
> <seewebsiteforem...@erdani.org> said:
> 
> > That design, with which I experimented for a while, had two drawbacks:
> > 
> > 1. It had the default reversed, i.e. most often you want to regard a 
> > char[] or a wchar[] as a range of code points, not as an array of code 
> > units.
> > 
> > 2. It had the unpleasant effect that most algorithms in std.algorithm 
> > and beyond did the wrong thing by default, and the right thing only if 
> > you wrapped everything with byDchar().

Hello Michel,

> Well, basically these two arguments are the same: iterating by code 
> unit isn't a good default. And I agree. But I'm unconvinced that 
> iterating by dchar is the right default either. For one thing it has 
> more overhead, and for another it still doesn't represent a character.

This is an issue evoked in a previous thread some weeks ago. More on it below.

> Now, add graphemes to the equation and you have a representation that 
> matches the user-perceived character concept, but for that you add 
> another layer of decoding overhead and a variable-size data type to 
> manipulate (a grapheme is a sequence of code points). And you have to 
> use Unicode normalization when comparing graphemes. So is that a good 
> default? Probably not. It might be "correct" in some sense, but it's 
> totally overkill for most cases.

It is not possible, as writer of a textprocessing lib ot Text type, to define a 
right level of abstraction (code unit, code point, or grapheme) that would both 
be usually efficent and avoid unexpected failures for "naive" use of the tool.
The only safe level in 99% cases is the highest-level one, namely grapheme. 
Only then can one be sure that, for instance text.count("ä") will actually 
count "ä"'s in source text. But in most cases, this is overkill. It depends on 
what the text actually, *and* on what the programmer knows about it (I mean 
that texts may be plain ASCII, so that even unsigned byte strings would do the 
job, but if the programmer cannot guess it...).
The tool writer cannot guess anything.

> My thinking is that there is no good default. If you write an XML 
> parser, you'll probably want to work at the code point level; if you 
> write a JSON parser, you can easily skip the overhead and work at the 
> UTF-8 code unit level until you start parsing a string; if you write 
> something to count the number of user-perceived characters or want to 
> map characters to a font then you'll want graphemes...

At least 3 factors must be taken into account:

1. The actual content of source texts. For instance, 99.999% of all texts won't 
ever hold code points > ffff. This tells which size should be used for code 
units. The safe general choice indeed beeing 32 bits.

2. The normalisation form of graphemes; whether they are decomposed (the right 
choice), or in unknown form or possibly in mixed forms, or as precomposed as 
possible. In the latter case (by far the most common one for western language 
texts), and one can assert that every grapheme in every source text to be dealt 
with has a fully precomposed form (= 1 single code *point*), then the level of 
code points is safe enough.

3. Whether text is just transferred through an app or is also processed. Many 
apps just use some bits of input texts (files, user input, literals) as is, 
without any processing, and often output some of them, possibly concatenated. 
This is safe whatever the abstraction level of text representation used; one 
can concat plain utf8 representing composite graphemes in decomposed form. 

But as soon as any text-processing routine is used (indexing, slicing, find, 
count, replace...), then questions arise about correctness of the app.

And, as said already, to be able to safely choose any lower-level of 
repreentation, the programmer must know about the content, its properties, its 
UCS coding. For instance, imagine you need to write an app dealing with texts 
containing phonetic symbols (IPA). How do you know which is the lowest safe 
level?
* What is the common coding of IPA graphemes in UCS?
* Can they be coded in various ways (yes!, too bad..)
* What is the highest code point ever possibly needed? (==> is utf8 or utf16 
enough for code points?)
* Do all graphemes have a fully precomposed form?
* Can I be sure that all texts will actually be coded in precomposed form (this 
depends on text producing tools), "for ever"?

> Perhaps there should be simply no default; perhaps you should be forced 
> to choose explicitly at which layer you want to operate each time you 
> apply an algorithm on a string... and to make this less painful we 
> could have functions in std.string acting as a thin layer over similar 
> ones in std.algorithm that would automatically choose the right 
> representation for the algorithm depending on the operation.

My next project should be to write one Text type dealing at the highest-level 
-- if only to showcase the issues invloved by the "missing level of 
abstraction" in common tools supposed to deal with universal text.
This is much easier in D thank to proper string types, and availibility of 
tools to cope with lower-level issues, mainly decoding/encoding and validity 
checking (I do not know yet how practicle said tools are).


denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Reply via email to