On 10/14/10 10:09 PM, Ben Kloosterman wrote: > On 10/14/10 3:51 PM, Jonathan S. Shapiro wrote: >> In futher practice, the number of strands tends to be small, so the >> difference between O(log n) and O(1) is negligible. > > Im not sure this is true for example in all languages “<” . “>” point > and numbers are ASCII. In chinese Y-M-D is mixed chinese and ASCII numerics > , in fact in nearly all languages you have UCS-2 codes but interspersed > ASCII numbers and punctuation. So you would need some sort of complex > encoding such that sequences of< length n stay in the higher encoding form. > This is also good because short strings would not need a tree and hence > incur no cost.
This is definitely an issue with the proposal. But if it can be surmounted, I think the stranded-string proposal is a nice one--- certainly better than settling on any particular utf-N for everything. UTF-8 works well for European languages and about half the content of Asian languages, but that doesn't convince me that the other half of Asian languages should get screwed, or that utf-8 is the best internal representation in the world. Solving this issue may take a bit of sufficient smartness however. If we're trying to avoid that, then the API should have ways of tweaking the behavior of when we switch encodings--- at the very least, it should have some way of saying when a (short) string should be forced to be a single strand, using whatever strand width is necessary. Working out the details of the rest of the API could be tricky though. -- Live well, ~wren _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
