On Mon, Oct 20, 2014 at 7:10 PM, Ben Kloosterman <[email protected]> wrote:

> On Tue, Oct 21, 2014 at 11:38 AM, Jonathan S. Shapiro <[email protected]>
> wrote:
>


> Here's the thing: Unicode makes quite a mess of things by permitting valid
>> text to be un-normalized in the first place. I'm not at all sure why they
>> did that; it seems to me that they could easily have put well-formedness
>> rules in place. Though perhaps once you had both NFC and NFD that ship had
>> sailed.
>>
>
> Agree this searching should not have this complication .
>

Well, sure. But it does, so we have to deal with it somehow.


> What really has me in a twist here is that you want things like string
>> comparison and search to work sensibly without huge complications, and on
>> longer strings you would really rather not be forced to copy them in order
>> to normalize them for comparison and search.
>>
>> Has anybody looked into algorithms for search and compare that normalize
>> on the fly? If the penalty isn't too great, maybe that's the right way to
>> resolve this.
>>
>
> I assume this will not work behind the scenes with strings being immutable
> or be expensive creating other data types.
>

I don't think immutability is an issue here. Try thinking about it this way:

At bottom, a comparison or a search is implemented by comparing characters
to characters. When you get down to the level of comparing two characters,
you need to consider a base code point and some number of modifier code
points. The thing is: the sequences involved aren't that long, and at worst
you need to reorder some of them. So it's not insane to think that you
could do the normalization on the fly. And I wonder if there might not be
algorithms that can avoid the reordering altogether (I think there may be).
Actually, I'm thinking a string compare can be done in O(n) on the string
length, even with the normalization issue.


> I think its best to assume strings are compliant but not enforce it ( from
> the above i think we are on the same page - earlier the language was a bit
> strong)
>

If it's not compliant, how is it a string?

We clearly have to work with denormalized strings. That's not the issue
here. The issue is: what does it *mean* (as a matter of semantics) to be a
string?


>  .. users should convert on import if they are not sure and the default
> encoders should do this.
> ie when importing a UTf8 web page or Xml you can directly just wrap the
> byte data to make a string if you are sure of your data  and the user
> should convert but if you run an encoder you do normalize it which includes
> a UTf8 to Utf8NFCencoder.
>

No. You can't just directly wrap the byte data. Either you validate or you
lie to yourself.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to