.graphemes methods

Brent 'Dax' Royal-Gordon Sat, 03 Jul 2004 03:37:58 -0700

Aaron Sherman wrote:

On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:

(2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.


Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).

At the Parrot level, codepoint operations will generally be the most efficient, even on strings with exotic charsets. Parrot uses an internal encoding that allows O(1) access to codepoints; essentially, it uses an array of 8-, 16-, or 32-bit integers, depending on the highest codepoint value. This is the default even for character sets with shift characters, like Shift-JIS.

On strings where all codepoints have values under 256, bytewise and codepointwise lookup are equivalent; otherwise, though, bytewise lookup will actually be *slower* than codepointwise, as Parrot will maintain the illusion that each codepoint is stored in an integer that's the perfect size for it.

If you force Parrot to use the UTF-8 encoding internally then bytewise lookup becomes fastest, and codepointwise slows down a lot. But you really shouldn't do that--UTF-8 is ill-suited for actually *manipulating* text, unlike the Parrot internal encodings. (UTF-16 and UTF-32 will presumably be available too, although I've seen no specific mention of them.)

You can also force it to use a "raw" or "bytes" encoding, where bytes and codepoints are identical. But you can't store Unicode characters in such a string and have them behave in a reasonable way.

(Note: this is all based on my own, possibly false, memory.)

--
Brent "Dax" Royal-Gordon <[EMAIL PROTECTED]>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to