On 23 Dec 2005, at 11:10 , Michael Koziarski wrote:
Simply put, a _character_ is no longer _one byte long_ when you get
beyond the characters you can see printed on your keyboard. Even
simple punctuation like these "double quotation marks" take up
_two bytes_ each, and stuff like ⾦ is _three bytes_ in UTF-8.
The problem with UTF-8 is that the length of characters varies. So
something like this:
a_string[434..2443]
is no longer O(1). This is why things are often stored with ucs-2
internally, and converted at the boundaries. I believe this is how
the JVM handles things, but I could be completely wrong.
You're right, this is how the String class in Java stores Unicode
data internally. The problem with UCS-2 is that it only allows you to
encode the 'Basic Multilingual Plane' because you can only use 16
bits for each character. Don't confuse UCS-2 with UTF-16, where each
character can take up 2 or 4 bytes.
See http://en.wikipedia.org/wiki/UCS-2 for more on this.
The reason we are talking about UTF-8 is that this is everyone is
already using this encoding in their Rails apps and that it allows
you to handle ASCII data without ever thinking about it.
See http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF for why
UTF-8 might actually be a good idea.
But Jamis' point is a valid one, I think one of the key reasons that
rails has been successful is that we haven't just gone mad adding
features left right and center. Everything which gets in is taken
from an application where it's been proven. In other frameworks
where this hasn't happened you get annoying bugs, and sub-par apis.
This is a valid point, but it does not apply to this issue. Rails is
currently annoyingly buggy when you need to handle Unicode data.
i18n is something I care about, but it's not something I need for my
paid work. I think the ideal way to get it into core is for people
who are experts *and* need it in their paid work to produce a plugin.
I think Julian might be our expert and he's currently working on a
solution. Please see his previous email for details.
Then once the plugin has been in use by the community, we can roll it
in. I18n is extremely important, i18n needs to end up in the core
distribution. But we need to do it the 'rails way'.
Although you can't have proper i18n without good Unicode support,
good Unicode support is _not_ about i18n. Even if your app will never
ever handle anything but english text, you still need to handle stuff
like punctuation in text your users are copying and pasting from Word.
Please, please, don't ignore this issue because David said that i18n
should be handled at the application level.
Kind regards,
Thijs van der Vossen
--
Fingertips - http://www.fngtps.com
+31 (0)6 24204845
[EMAIL PROTECTED]
_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core