Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Sun, 25 Mar 2012 23:11:58 -0700

Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, 
but we force them to deal with UTF-16 and additional flags because we need them 
for compatibility. Within modules, where we know that compatibility is not an 
issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless 
for many applications despite UTF-16 because Java already had a rich API 
performing all kinds of operations on strings, so many applications had little 
need to look at individual characters in the first place. We went through the 
entire Java SE API and fixed all those operations to use code point semantics 
(look for "under the hood" at [1] for details). We were also able to switch 
regular expressions to code point semantics without any flags because regular 
expressions never worked on binary data and developers hadn't created funky 
workarounds to support supplementary characters yet. JavaScript today has more 
constraints, but for new development it would still be good to get as close as 
possible to that experience.

Norbert

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

On Mar 24, 2012, at 23:56 , David Herman wrote:

> On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:
> 
>> One concern: I think code point based matching should be the default for 
>> regex literals within modules (where we know the code is written for 
>> Harmony).
> 
> This idea makes me nervous. Partly because I think we should keep the set of 
> semantic changes between non-module code and module code reasonable small, 
> and partly because the idea of your proposal is to continue to treat strings 
> as sequences of 16-bit code units, not Unicode code points-- which means that 
> quietly switching regexps to be closer to operating at the level of code 
> points seems like it creates a kind of impedance mismatch. It feels more 
> appropriate to me to require programmers to declare explicitly that they're 
> dealing with a string at the level of code points, using the (quite concise) 
> /u flag. That way they're saying "yes, I know this string is just a sequence 
> of 16-bit code points, but it may contain non-BMP data, and I would like to 
> match its contents with a regexp that deals with code points."
> 
> (Again, I'm still new to the finer points of Unicode, so I'm prepared to be 
> shown I'm thinking about it wrong.)
> 
> Dave
> 

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to