On Wed, Oct 15, 2014 at 10:00 PM, Ben Kloosterman <[email protected]> wrote:
> 2. Does following set of rules for strings make sense? If no, why not? > > - Strings are normalized via NFC > - String operations preserve NFC encoding > > BK> Not sure if you cant treat this as granted .. say web sites UTF8 and > Xml messages do you really want to parse and rearrange all that data to > ensure NFC compliance ? > I claim that if you have to re-normalize it, then the incoming XML data isn't text. That doesn't stop you from dealing with it as byte data via a byte vector, and I can definitely see a case for implementing many string operations over either byte vector (or a wrapper on byte vector). Here's the thing: Unicode makes quite a mess of things by permitting valid text to be un-normalized in the first place. I'm not at all sure why they did that; it seems to me that they could easily have put well-formedness rules in place. Though perhaps once you had both NFC and NFD that ship had sailed. What really has me in a twist here is that you want things like string comparison and search to work sensibly without huge complications, and on longer strings you would really rather not be forced to copy them in order to normalize them for comparison and search. Has anybody looked into algorithms for search and compare that normalize on the fly? If the penalty isn't too great, maybe that's the right way to resolve this. Hmm. Thought that implies carrying a lot of megabytes of codepoint normalization rules. Yuck. > In terms of performance / compatibility of old algorithms/ benchmarks ascii > is still important. For this reason its important to know that the UTF8 > is ascii. > That's a very unfortunate thing to have as important, since it isn't correct. Perhaps you mean ASCII is valid UTF-8? shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
