Re: ICU incorporation and string changes heads-up

Jeff Clites Sat, 10 Apr 2004 02:52:57 -0700

On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:
On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
So internally, strings don't have an associated encoding (or chartype
or anything)
How do you handle EBCDIC?

I'll use pseudo-C to illustrate:

string = string_make(buffer, "EBCDIC"); /* buffer came from an EBCDIC file, for instance containing the word "hello" */

outputBuffer1 = encode(string, "EBCDIC");
outputBuffer2 = encode(string, "ASCII");
outputBuffer3 = encode(string, "UTF-8");
outputBuffer4 = encode(string, "UTF-32");

//maybe write buffers out to files now....

So far, that would all just work, and outputBuffers 1-4 all have the word "hello" serialized using various encodings.

Now, let's say instead that we had done:

string = string_make(buffer, "ASCII"); /* buffer came from an ASCII file, for instance containing the word "hello" */

Now the other 4 lines of code would be the same, and outputBuffers 1-4 are identical to what they would have been above, and in fact "string" would be identical in the 2 cases as well.

UTF8 for Ponie?

Ponie should only need to pretend that a string is represented internally in UTF-8 in a limited number of situations--not for string comparisons or (most?) regular expressions, etc. For the cases where it does need that, it can on-the-fly (or, cached) create a buffer of bytes to work from. But I think of that as a backward compatibility, and not the case we're optimized for.

- Where is string->language?

I removed it from the string struct because I think that's the wrong
place for it (and it wasn't actually being used anywhere yet,
fortunately).

Not used *yet* - what about:

   use German;
   print uc("i");
   use Turkish;
   print uc("i");

Perfect example. The string "i" is the same in each case. What you've done is implicitly supplied a locale argument to the uc() operation--it's just a hidden form of:

uc(string, locale);

The important thing is that the locale is a parameter to the operation, not an attribute of the string.

language-dependent (sorting, for example), the operation doesn't depend on the language of the strings involved, but rather on the locale of the reader.
And if one is working with two different language at a time?

Hmm? The point is that if you have a list of strings, for instance some in English, some in Greek, and some in Japanese, and you want to sort them, then you have to pick a sort ordering. If you associate a language with each string, that doesn't help you--how do you compare the English and Japanese strings with one another? So again, the sort ordering is a parameter to the sort operation:

sort_strings(array, locale);

And you could certainly have:

sortedInOneWay = sort_strings(someArrayOfStrings, locale1)
sortedInAnotherWay = sort_strings(someArrayOfStrings, locale2)

That's only awkward if we're assuming a locale is hanging around implicitly specified--it's it's an explicit parameter, it's very clear. And in practice, the collation algorithm specified by a locale has to be prepared to handle sorting any strings at all--so that you can sort Japanese strings in the English sort order, for instance. That sounds strange, but generally this tends to be modeled as a base sorting algorithm (covering all characters/strings) plus small per-locale variations. That means that the sort order for Kanji strings would probably end up being the same for the English and Dutch locales, though strings containing Latin characters might sort differently.

With this string type how do we deal with anything beyond codepoints?

Hmm, what do you mean?
 "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
 "\N{LATIN CAPITAL LETTER A WITH ACUTE}"
when comparing graphemes or letters. The latter might depend on the
language too.

Right, but this isn't comparing graphemes, it's this:

one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";

one eq two  //false--they're different strings
normalizeFormD(one) eq normalizeFormD(two)  //true

This is quite analogous to:

three = "abc"
four = "ABC"

three eq four  //false
uc(three) eq uc(four) //true

So it's not comparing gramphemes, it's comparing strings under normalization. It's a much clearer, and much more well-worked-out concept. If you want to think of this as comparing graphemes, you need to invent a data type to model "a grapheme", and that's a mess and nobody's ever done that. It's the wrong way to think about the problem.

We'll basically need 4 levels of string support:

,--[ Larry Wall ]-------------------------------------------------------- | level 0 byte == character, "use bytes" basically | level 1 codepoint == character, what we seem to be aiming for, vaguely | level 2 grapheme == character, what the user usually wants | level 3 letter == character, what the "current language" wants `---------------------------------------------------------------------- --

Yes, and I'm boldly arguing that this is the wrong way to go, and I guarantee you that you can't find any other string or encoding library out there which takes an approach like that, or anyone asking for one. I'm eager for Larry to comment.

or ask arbitrary strings what their N-th character
The N-th character depends on the level. Above examples C<.length> gives either 2 or 1, when the user queries at level 1 or 2. The same problem arises with positions. The current level depends on the scope were the string was coming from too. (s. example WRT turkish letter "i")

I'm arguing that whether you have dotted-i or dotless-i is decided when you create the string, and how those are case-mapped depend on your choice of case-folding algorithm. The ICU case-folding API has explicit variants to decide how to case fold the i's, for instance.

- What happenend to external constant strings?

They should still work (or could).


,--[ string.c:714 ]--------------------------------------------------
|   else /* even if buffer is "external", we won't use it directly */
`--------------------------------------------------------------------

Yes, that's the else clause for the case where we've determined that we can't use the passed-in buffer directly. A few lines above that is the place where we have a mem_sys_memcopy() which we can avoid (but the code doesn't yet, I see), and use the passed-in buffer in the external-constant case.

- What's the plan towards all the transcode opcodes? (And leaving these as a noop would have been simpler)
Basically there's no need for a transcode op on a string--it no longer
makes sense, there's nothing to transcode.
I can't imagine that. I've an ASCII string and want to convert it to UTF8 and UTF16 and write it into a file. How do I do that?

That's the mindset shift. You don't have an ASCII string. You have a string, which may have come from a file or a buffer representing a string using the ASCII encoding. It's the example from above, again:

inputBuffer = read(inputHandle);
string = string_make(inputBuffer, "ASCII");
outputBuffer = encode(string, "UTF-16");
write(outputHandle, outputBuffer);

or, if we want to associate an encoding with a handle (sometimes more convenient, sometimes less), it could go like this:

inputHandle = open("file1", "ASCII");
string = stringRead(inputHandle);
outputHandle = open("file2", "UTF-16");
stringWrite(outputHandle, string);

Basically, we need to think of strings from the top down (what concept are they trying to capture, what API makes sense for that), rather than from the bottom up (how are bytes stored in files). The key is to think of strings as, by definition, representing a sequence of (logical) characters, and to think of a character as what's trying to be captured by things such as "LATIN CAPITAL LETTER A WITH ACUTE" or "SYRIAC QUSHSHAYA". That's the whole cookie.

(And again, I wish I could take credit for all of this, but it's really the general state-of-the-art with regard to strings. It's how people are thinking about these things these day.)

JEff

Re: ICU incorporation and string changes heads-up

Reply via email to