Re: [Rails-core] Investigating Unicode. Take 2, with nastities and allegations.

Thijs Van Der Vossen Fri, 23 Dec 2005 03:22:02 -0800

On 23 Dec 2005, at 11:10 , Michael Koziarski wrote:

Simply put, a _character_ is no longer _one byte long_ when you get
beyond the characters you can see printed on your keyboard. Even
simple punctuation like these "double quotation marks" take up
_two bytes_ each, and stuff like ⾦ is _three bytes_ in UTF-8.


The problem with UTF-8 is that the length of characters varies.  So
something like this:

a_string[434..2443]

is no longer O(1).   This is why things are often stored with ucs-2
internally, and converted at the boundaries.  I believe this is how
the JVM handles things, but I could be completely wrong.

You're right, this is how the String class in Java stores Unicodedata internally. The problem with UCS-2 is that it only allows you toencode the 'Basic Multilingual Plane' because you can only use 16bits for each character. Don't confuse UCS-2 with UTF-16, where eachcharacter can take up 2 or 4 bytes.


See http://en.wikipedia.org/wiki/UCS-2 for more on this.

The reason we are talking about UTF-8 is that this is everyone isalready using this encoding in their Rails apps and that it allowsyou to handle ASCII data without ever thinking about it.

See http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF for whyUTF-8 might actually be a good idea.

But Jamis' point is a valid one,  I think one of the key reasons that
rails has been successful is that we haven't just gone mad adding
features left right and center.  Everything which gets in is taken
from an application where it's been proven.    In other frameworks
where this hasn't happened you get  annoying bugs,  and sub-par apis.

This is a valid point, but it does not apply to this issue. Rails iscurrently annoyingly buggy when you need to handle Unicode data.

i18n is something I care about, but it's not something I need for my
paid work.   I think the ideal way to get it into core is for people
who are experts *and* need it in their paid work to produce a plugin.

I think Julian might be our expert and he's currently working on asolution. Please see his previous email for details.

Then once the plugin has been in use by the community, we can roll it
in.   I18n is extremely important,   i18n needs to end up in the core
distribution.   But we need to do it the 'rails way'.

Although you can't have proper i18n without good Unicode support,good Unicode support is _not_ about i18n. Even if your app will neverever handle anything but english text, you still need to handle stufflike punctuation in text your users are copying and pasting from Word.

Please, please, don't ignore this issue because David said that i18nshould be handled at the application level.


Kind regards,
Thijs van der Vossen

--
Fingertips - http://www.fngtps.com
+31 (0)6 24204845
[EMAIL PROTECTED]



_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core

Re: [Rails-core] Investigating Unicode. Take 2, with nastities and allegations.

Reply via email to