On Thu, 15 Jan 2009, Bryan Jurish wrote:
Unicode might be more immediately intuitive to most users, but when it
comes down to it, byte-strings are IMHO the more basic representation (a
char* is still a char*, even in this post-unicode world).
What happened is that people switched to UTF-8 instead of some fixed-size
encoding because many apps that assume that a character is a byte will
work anyway. Just don't ask those apps to say how many characters there
are in a string though. You have to pretend that all the "special"
characters are pairs of characters instead (when they are not triplets).
A good string handling mechanism should have a good general default
representation (e.g. as UTF-${MachineWordBits}), but should likewise
allow access to "raw" byte strings, and be able to accommodate various
encodings. Not that I'm really hankering to write any of that, mind you
;-) Perhaps a better name for the external as I think of it would be
[any2bytes]. I'm perfectly willing to cede the "string" name to
something better (Martin's string patch comes to mind),
I gather that it'll take a long time before Pd gets unicode support...
... except if you're building rsp. reading a persistent index for a
large file, in which case tell() & seek() are likely to be a wee bit
faster than parsing and counting variable-length-encoded characters ...
right.
_ _ __ ___ _____ ________ _____________ _____________________ ...
| Mathieu Bouchard - tél:+1.514.383.3801, Montréal, Québec
_______________________________________________
[email protected] mailing list
UNSUBSCRIBE and account-management ->
http://lists.puredata.info/listinfo/pd-list