It's just my gut-level feeling that traditional world of C, Unix, locales, etc. simply does not provide appropriate abstractions to deal with internationalization. Yes, you can get there if you throw enough libraries and random functions and macros and pipes and filters at it, but the basic abstractions leak like a seive. It's time to clean it all up.
Beyond encoding, locales arent all that bad. I think toupper is broken for language specific case folding. gettext isnt a beauty queen, but it works well enough considering the job it has to do. I'd be interested to know which parts you find to be of poor design?
Anyway, if anyone wants to give me specific feedback on the current design of Perl 6, that'd be cool.
I did take a peek, I would like to ask a few questions actually. What is a "Unicode abstraction level" ? It seems to only be mentioned in the context of perl6. I assume it means the default way to operate on strings (by bytes, code points, characters, etc ) Also, if "A Str is a Unicode string object.", which encoding is it internally? i.e., if I have a Str $A, and i say "$A.bytes" should I have any expectation for what to get for a given string literal? It seems perl will effectively be enforcing this encoding. (if that is utf-8, i have no objections though) Lastly, just as an overall comment: The Str concept does seem to be a bit heavy, but at the same time if most of the operations are "lazy" then it would seem to be a simple thing for me to set my default such that all operations would happen in byte mode. As long as the library routines do not make assumptions about what my global default is, that should work fine. (But if setting the UAL to "bytes" breaks all unicode aware library functions, that would make it useless though) For example: If I read in a list a filenames into "Str" scalars. some of the filenames are valid utf-8, some have garbagy parts. I'd like to be able to perform an operation, such as appending some other Str to them, then creating a new file with the new name, an perhaps printing out said name. If that happens without my program having to jump through any hoops, or getting scolded with some warnings to stderr, then I'll be satisfied. I wont want to know of care how many morphemes are in the strings, or whether or not they contain invalid sequences, and if the Str class is truely lazy, then it wont care either. No code will ever bother to investigate the number of glyhphs until someone asks. That, I could agree with :) I know I definitely want substr and it's ilk to work on the bytes, but one tricky part might be regexes: What I would want is the regex engine to work in a utf-8 aware mode, but to gracefully handle the presence of bad sequences, and not just croak when it sees them. (similar to how "Encode" allows me to specify whether to croak or ride it out )
Though [email protected] would probably be a better forum for that.
If you think the above comments and questions would be better placed there, I'd be happy to repost in that list. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
