Re: perl unicode support [BACK ON-TOPIC]

ＳｒｉｎＴｕａｒ Thu, 29 Mar 2007 13:51:04 -0800

It's just my gut-level feeling that traditional world of C, Unix,
locales, etc. simply does not provide appropriate abstractions to deal
with internationalization.  Yes, you can get there if you throw enough
libraries and random functions and macros and pipes and filters at it,
but the basic abstractions leak like a seive.  It's time to clean it
all up.


Beyond encoding, locales arent all that bad. I think toupper is broken for
language specific case folding. gettext isnt a beauty queen, but it works
well enough considering the job it has to do.

I'd be interested to know which parts you find to be of poor design?

Anyway, if anyone wants to give me specific feedback on the current
design of Perl 6, that'd be cool.


I did take a peek, I would like to ask a few questions actually.

What is a "Unicode abstraction level" ? It seems to only be mentioned
in the context of perl6.
I assume it means the default way to operate on strings (by bytes,
code points, characters, etc )

Also, if "A Str is a Unicode string object.", which encoding is it internally?
i.e., if I have a Str $A, and i say "$A.bytes" should I have any
expectation for what to get for a given string literal? It seems perl
will effectively be enforcing this encoding. (if that is utf-8, i have
no objections though)

Lastly, just as an overall comment: The Str concept does seem to be a
bit heavy, but at the same time if most of the operations are "lazy"
then it would seem to be a simple thing for me to set my default such
that all operations would happen in byte mode. As long as the library
routines do not make assumptions about what my global default is, that
should work fine. (But if setting the UAL to "bytes" breaks all
unicode aware library functions, that would make it useless though)

For example:
If I read in a list a filenames into "Str" scalars. some of the
filenames are valid utf-8, some have garbagy parts. I'd like to be
able to perform an operation, such as appending some other Str to
them, then creating a new file with the new name, an perhaps printing
out said name.

If that happens without my program having to jump through any hoops,
or getting scolded with some warnings to stderr, then I'll be
satisfied. I wont want to know of care how many morphemes are in the
strings, or whether or not they contain invalid sequences, and if the
Str class is truely lazy, then it wont care either. No code will ever
bother to investigate the number of glyhphs until someone asks. That,
I could agree with :)

I know I definitely want substr and it's ilk to work on the bytes, but
one tricky part might be regexes: What I would want is the regex
engine to work in a utf-8 aware mode, but to gracefully handle the
presence of bad sequences, and not just croak when it sees them.
(similar to how "Encode" allows me to specify whether to croak or ride it out )

Though [email protected] would probably be a better forum for that.


If you think the above comments and questions would be better placed
there, I'd be happy to repost in that list.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK ON-TOPIC]

Reply via email to