[Rails-core] Investigating i10n/i18n issues

Julian 'Julik' Tarkhanov Sun, 18 Dec 2005 11:23:36 -0800

Hello to everyone on the Core.

Recently I promised Joshua Harvey (of the Globalize plugin fame) toinvestigate the Rails code for possible multibyte issues. Pity that Ididn't have much time to do it quickly, but my findings are sad(although productive). Not so long ago I have filed a bug #2103 whichgot a prompt fix by Jamis.


The name of the bug reads: truncate() helper is not multibyte-safe

The actual name of it should have been: String#[] method is brokenfor multibyte strings

Yes, this is not a Rails problem. Most of the String methods in Rubyare not mb-safe, although String implies working with charactersinstead of bytes. To fix the bug I filed, Jamis needed to introduceALL THAT (http://dev.rubyonrails.org/changeset/2265) for the fix andthe test, including a special "sandbox" mode to test the effects ofthe helper. I assume that for every situation where a bug like thisis found, just as many lines are going to be needed (sandboxed test +code fork at the end-user API). And I investigated how much of thesemight that be.

The response is the following: all of Rails. Take a look, forexample, at this file within ActiveSupport.

http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/core_ext/string/access.rb

Let me tell you, all of this is broken. It's broken in Ruby and itstays broken in Rails. Because when you feed them multibyte stringsyou better be lucky that your Range covers the complete codepoints -otherwise you invalidate your output for ANY meaningful use (XML,conversion to another encoding etc.) - you can slice "into" acharacter and you will. And there is a very big problem which addsinsult to injury.

_Most Rails developers will never notice_. Why, you ask? Well,here's the answer.

By default, Ruby uses UTF-8 for the "unicoded" $KCODE setting. InUTF-8, all Latin-1 characters actually stay single-byte, so you wouldnever damage them by using "foobar"[0..2]. And you would always getcorrect "reversed" string.

But as soon as you pop ONE umlaut in there, as soon as you enter ONEcharacter which is not single-byte you introduce an error. Recently,I read this entry on the blog of Lucas Carlson.


http://tech.rufy.com/entry/93

Guess, WHY he is advising me to use "require 'jcode"? Because henever notices that his handling is broken until he both:

a) actually enters a multibyte character into his string

c) this multibyte character happens to exist right at the "slice" ofthe Range

Same with ActiveSupport. Essentially speaking, all of the problemsthat Ruby has with regards to multibyte handling, persist well intoRails, up to it's uppermost layers (such as RJS). Moreover - this isactually a tip of the iceberg. If we try to discover and file EVERYbug that appears in Rails with regards to multibyte handling,hundreds lines of code will appear to fix the issue at the wronglevel of the stack.

Let's see. To handle Unicode properly in a web app, we actually needit correctly and transparently handled across the following stacks:


[  database ] -- should normalize, store and sort
[  database driver ] -- should set the right client encoding
[  ruby ] -- should operate on strings properly <<BROKEN>>

[ rails ] -- should set the right headers and coodrinate input andoutput

--------

[ web-server ] -- should not do any implicit reencoding (some do, toolong to explain here)

[ proxy etc. ] - same as above
[ browser ] - should display and accept multibyte characters properly

Now the problem is, that fixes such as the one for truncate() are NOTthe solution, because they fix what has to be fixed in Ruby itself.If we look at this part of the stack more closely, we will see(pardon my ASCII):


[ Ruby ]
[           Rails
[   [ ActiveSupport]
[     [ AR], [AP], [AWS], .....

Which means that while we are working within Rails, we can alwaysexpect ActiveSupport to be available! Otherwise we wouldn't havethings such as symbolize_keys!, 20.days.from.now etc.

Now, Matz is promising proper multibyte Strings for Ruby 2.0 Thetrouble with this is that we never know WHEN it's coming - it's beingpromised for years, and the emails on "broken Unicode" in Ruby justkeep coming on ruby.lang.

So instead of reviewing ALL the (already immense) Rails codebase, Ihave a simple question.

We have a number of dependencies. We know that String IS BROKEN andit needs to be rewired. We know that most Multibyte-aware code is notusing String#methods, but Rails does use them. And we know thatActiveSupport is implied.


OTOH, we know the following:
a) most of Rails developers are not using EUC-JP or JIS
b) the ones that NEED multibyte strings are using UTF-8

c) the ones that THINK they DON'T need have a BIG problem and need aslap on their headd) the regex engine we have now is already much more multibyte-awarethan the String methods

e) jcode.rb does something, but it's NOT enough

Because they stand a chance of being bitten by the issue as soon as aFirst User Types The First Double-Byte Character Into One Of TheirForms. After that you can expect many many nastities to happen.

And on the other hand we have the Unicode gem. While Ruby 2.0 is longfrom finished, we already have Unicode-aware case conversions,Unicode-aware normalization and decomposition. All of these can beeasily wired into the String class itself to provide _out_of_the_box_fixes to multibyte issues for EVERYONE who:


a) has the Unicode gem (I don't know how to get it running on Win32)

b) uses UTF8 as his KCODE (which right now is a Rails requirement forusing multibyte strings)

c) is running under ActiveSupport loaded

This also can be made optional (like ActiveSupport::use_utf8() pragma-like statement) For people using EUC-JP and other Kanji systems wereally have to step out of the way (I don't have any understanding oftheir languages to make judgements, but I suspect that most of whatthey might need from a Rails app is supported with UTF-8 - it wouldjust require transcoding because of the enormous amoutn of otherKanji data already in the wild).

It is really that simple. Some 60 lines of String rewiring get youvery far, they free you from slicing characters, they get you normalreverse() and index() mechanics etc. But - this is not "really" thepie of ActiveSupport, because it overrides and rewires a substantialCORE language feature. And if one would say "it's nasty to overridethe core language" I would agree - but not in the case of Rails.Currently, a rewired String class would provide _exactly_ the samefunctionality as the default String class outlined in Ruby2.0 by Matz(character oriented vs. byte oriented - and that's how it works nowfor ASCII).


So the question is quite simple.

Is this a viable path? Fix String for UTF-8 users once and for alland get a substantial part of Rails to be multibyte-safe actually_for free_, or go on, sticking our heads in the sand, finding bugs inRails itself and (temporarily) healing the symptoms instead of themalady?

This brings in another issue of Unicode support. The Python and Perlways of doing it are to distinguish between a "bytestring" and a"unicode string". This is a way of the apocalypse. It implies thatevery developer, in every function, in every subroutine and everyblock call must explicitly cast one into the other (because you nevercan be sure which one you are getting). MovableType circumvents thisby processing ALL as bytestrings (doing the unpack+pack voodoo toshake "off" the UTF flag), other packages do other things - but theproblem STICKS, because all of the developers prefer to output"normal" bytestrings and get them in as well. Which has led me to asimple realisation:

* * * * As long as multibyte support is optional, nobody gives ash..t if it works.

Let's take a simple example. Someone makes a helper that truncatesthe excerpt of the entry automatically to N characters. Let's askourselves: if he wanted to do it properly, would he look into thelibrary "ActiveSupport" which would add "safe_truncate" to String orwould he just call string[0..len] ? What would you do?

ActiveSupport is a vey good and vast Ruby extension module. Whycouldn't we add something _really_ important to it instead ofsyntactic sugar only? Something that really many people need?Something that would fix all the stack UNDER the Rails components sothat nobody even has to THINK about bugs like #2103, if not only forthe reason of the ignorance of the developer alone (like in the postby Lucas I've linked to)?

What do you think? Please note that I am heavily biased because everysingle piece of software I used since I was 12 had problems withRussian letters, and Rails is no exception 10 years later, on a fullyUnicode-capable Unix box. If the core language has to be bent INTOshape (I call this "into" rather than "out of") to make things JustWork, why not?


--
Julian 'Julik' Tarkhanov
me at julik.nl



_______________________________________________
Rails-core mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails-core

[Rails-core] Investigating i10n/i18n issues

Reply via email to