Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:

> Dnia czw 3. lipca 2003 19:02, srintuar26 napisa�:
>
> > Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial.
>
> Replace "whitespace" with "an arbitrary character predicate", e.g. for finding
> the end of an identifier.
>
> > If you want to iterate over a string, dont use single codepoint indexing.
>
> What to use instead? What should be the interface to a function which splits
> a string into a list of strings by characters satisfying a predicate?
>
It would be a function that splits a UTF-8 string into a list of
strings by substrings satisfying a predicate.  Note that it can also
be more powerful and cover more real-world situations (e.g. 'ch'
character pairs in spanish).  The question is what's the interface to
the predicate.  It can recieve a single-codepoint string (and then
it's precisely equivallent) or it can recieve an index/pointer into
the string and return a boolean (or whatever) result and a new index.
This would be general enough to support splitting on occurances of
arbitrary regexps, for example.

> > (For example, Spanish in NFD wouldnt work even in UTF-32, because some
> > letters would take two codepoints)
>
> If someone chooses NFD, he will get what he deserves. It should not penalize
> all programmers.
>
In many languages NFC wouldn't save you either.

> > I think many people have yet to realize that the structure of unicode is
> > inhernently biased towards multi-byte encodings. Unicode is multi-codepoint
> > by design, and the conception that "encoding-unit == codepoint ==
> > character" is fundamentally broken, and an illusion that may trap the
> > unwary.
>
> It's easier to map codepoints to characters than to map bytes to characters
> (which must go through codepoints anyway). What kind of string processing
> UTF-8 makes simpler than UTF-32?
>
Any processing that works now on ASCII and doesn't break UTF-8
sequences in the middle.  UTF-8 is better for interfacing to
unicode-unaware libraries that take things like file paths and will
split them on ``/`` and will certainly not expect null bytes in the
middle.  UTF-8 will "just work" for most ``char *`` C APIs, that's the
whole point of it.

-- 
Beni Cherniavsky <[EMAIL PROTECTED]>

"Reading the documentation I felt like a kid in a toy shop."
 -- Phil Thompson on Python's standard library
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to