On Wed, Oct 15, 2014 at 10:00 PM, Ben Kloosterman <[email protected]>
wrote:

> 2. Does following set of rules for strings make sense? If no, why not?
>
>    - Strings are normalized via NFC
>    - String operations preserve NFC encoding
>
> BK> Not sure if you cant treat this as granted .. say web sites UTF8 and
> Xml messages do you really want to parse and rearrange all that data to
> ensure NFC compliance ?
>

I claim that if you have to re-normalize it, then the incoming XML data
isn't text.

That doesn't stop you from dealing with it as byte data via a byte vector,
and I can definitely see a case for implementing many string operations
over either byte vector (or a wrapper on byte vector).

Here's the thing: Unicode makes quite a mess of things by permitting valid
text to be un-normalized in the first place. I'm not at all sure why they
did that; it seems to me that they could easily have put well-formedness
rules in place. Though perhaps once you had both NFC and NFD that ship had
sailed.

What really has me in a twist here is that you want things like string
comparison and search to work sensibly without huge complications, and on
longer strings you would really rather not be forced to copy them in order
to normalize them for comparison and search.

Has anybody looked into algorithms for search and compare that normalize on
the fly? If the penalty isn't too great, maybe that's the right way to
resolve this.

Hmm. Thought that implies carrying a lot of megabytes of codepoint
normalization rules. Yuck.



> In terms of performance / compatibility of old algorithms/ benchmarks   ascii
> is still important. For this reason its important to know  that the UTF8
> is ascii.
>

That's a very unfortunate thing to have as important, since it isn't
correct. Perhaps you mean ASCII is valid UTF-8?


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to