Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > Dnia czw 3. lipca 2003 19:02, srintuar26 napisa�: > > > Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial. > > Replace "whitespace" with "an arbitrary character predicate", e.g. for finding > the end of an identifier. > > > If you want to iterate over a string, dont use single codepoint indexing. > > What to use instead? What should be the interface to a function which splits > a string into a list of strings by characters satisfying a predicate? > It would be a function that splits a UTF-8 string into a list of strings by substrings satisfying a predicate. Note that it can also be more powerful and cover more real-world situations (e.g. 'ch' character pairs in spanish). The question is what's the interface to the predicate. It can recieve a single-codepoint string (and then it's precisely equivallent) or it can recieve an index/pointer into the string and return a boolean (or whatever) result and a new index. This would be general enough to support splitting on occurances of arbitrary regexps, for example.
> > (For example, Spanish in NFD wouldnt work even in UTF-32, because some > > letters would take two codepoints) > > If someone chooses NFD, he will get what he deserves. It should not penalize > all programmers. > In many languages NFC wouldn't save you either. > > I think many people have yet to realize that the structure of unicode is > > inhernently biased towards multi-byte encodings. Unicode is multi-codepoint > > by design, and the conception that "encoding-unit == codepoint == > > character" is fundamentally broken, and an illusion that may trap the > > unwary. > > It's easier to map codepoints to characters than to map bytes to characters > (which must go through codepoints anyway). What kind of string processing > UTF-8 makes simpler than UTF-32? > Any processing that works now on ASCII and doesn't break UTF-8 sequences in the middle. UTF-8 is better for interfacing to unicode-unaware libraries that take things like file paths and will split them on ``/`` and will certainly not expect null bytes in the middle. UTF-8 will "just work" for most ``char *`` C APIs, that's the whole point of it. -- Beni Cherniavsky <[EMAIL PROTECTED]> "Reading the documentation I felt like a kid in a toy shop." -- Phil Thompson on Python's standard library -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
