Hello to everyone on the Core.

Recently I promised Joshua Harvey (of the Globalize plugin fame) to investigate the Rails code for possible multibyte issues. Pity that I didn't have much time to do it quickly, but my findings are sad (although productive). Not so long ago I have filed a bug #2103 which got a prompt fix by Jamis.

The name of the bug reads: truncate() helper is not multibyte-safe

The actual name of it should have been: String#[] method is broken for multibyte strings

Yes, this is not a Rails problem. Most of the String methods in Ruby are not mb-safe, although String implies working with characters instead of bytes. To fix the bug I filed, Jamis needed to introduce ALL THAT (http://dev.rubyonrails.org/changeset/2265) for the fix and the test, including a special "sandbox" mode to test the effects of the helper. I assume that for every situation where a bug like this is found, just as many lines are going to be needed (sandboxed test + code fork at the end-user API). And I investigated how much of these might that be.

The response is the following: all of Rails. Take a look, for example, at this file within ActiveSupport.

http://dev.rubyonrails.org/browser/trunk/activesupport/lib/ active_support/core_ext/string/access.rb

Let me tell you, all of this is broken. It's broken in Ruby and it stays broken in Rails. Because when you feed them multibyte strings you better be lucky that your Range covers the complete codepoints - otherwise you invalidate your output for ANY meaningful use (XML, conversion to another encoding etc.) - you can slice "into" a character and you will. And there is a very big problem which adds insult to injury.

_Most Rails developers will never notice_. Why, you ask? Well, here's the answer.

By default, Ruby uses UTF-8 for the "unicoded" $KCODE setting. In UTF-8, all Latin-1 characters actually stay single-byte, so you would never damage them by using "foobar"[0..2]. And you would always get correct "reversed" string.

But as soon as you pop ONE umlaut in there, as soon as you enter ONE character which is not single-byte you introduce an error. Recently, I read this entry on the blog of Lucas Carlson.

http://tech.rufy.com/entry/93

Guess, WHY he is advising me to use "require 'jcode"? Because he never notices that his handling is broken until he both:
a) actually enters a multibyte character into his string
c) this multibyte character happens to exist right at the "slice" of the Range

Same with ActiveSupport. Essentially speaking, all of the problems that Ruby has with regards to multibyte handling, persist well into Rails, up to it's uppermost layers (such as RJS). Moreover - this is actually a tip of the iceberg. If we try to discover and file EVERY bug that appears in Rails with regards to multibyte handling, hundreds lines of code will appear to fix the issue at the wrong level of the stack.

Let's see. To handle Unicode properly in a web app, we actually need it correctly and transparently handled across the following stacks:

[  database ] -- should normalize, store and sort
[  database driver ] -- should set the right client encoding
[  ruby ] -- should operate on strings properly <<BROKEN>>
[ rails ] -- should set the right headers and coodrinate input and output
--------
[ web-server ] -- should not do any implicit reencoding (some do, too long to explain here)
[ proxy etc. ] - same as above
[ browser ] - should display and accept multibyte characters properly


Now the problem is, that fixes such as the one for truncate() are NOT the solution, because they fix what has to be fixed in Ruby itself. If we look at this part of the stack more closely, we will see (pardon my ASCII):

[ Ruby ]
[           Rails
[   [ ActiveSupport]
[     [ AR], [AP], [AWS], .....

Which means that while we are working within Rails, we can always expect ActiveSupport to be available! Otherwise we wouldn't have things such as symbolize_keys!, 20.days.from.now etc.

Now, Matz is promising proper multibyte Strings for Ruby 2.0 The trouble with this is that we never know WHEN it's coming - it's being promised for years, and the emails on "broken Unicode" in Ruby just keep coming on ruby.lang.

So instead of reviewing ALL the (already immense) Rails codebase, I have a simple question.

We have a number of dependencies. We know that String IS BROKEN and it needs to be rewired. We know that most Multibyte-aware code is not using String#methods, but Rails does use them. And we know that ActiveSupport is implied.

OTOH, we know the following:
a) most of Rails developers are not using EUC-JP or JIS
b) the ones that NEED multibyte strings are using UTF-8
c) the ones that THINK they DON'T need have a BIG problem and need a slap on their head d) the regex engine we have now is already much more multibyte-aware than the String methods
e) jcode.rb does something, but it's NOT enough

Because they stand a chance of being bitten by the issue as soon as a First User Types The First Double-Byte Character Into One Of Their Forms. After that you can expect many many nastities to happen.

And on the other hand we have the Unicode gem. While Ruby 2.0 is long from finished, we already have Unicode-aware case conversions, Unicode-aware normalization and decomposition. All of these can be easily wired into the String class itself to provide _out_of_the_box_ fixes to multibyte issues for EVERYONE who:

a) has the Unicode gem (I don't know how to get it running on Win32)
b) uses UTF8 as his KCODE (which right now is a Rails requirement for using multibyte strings)
c) is running under ActiveSupport loaded

This also can be made optional (like ActiveSupport::use_utf8() pragma- like statement) For people using EUC-JP and other Kanji systems we really have to step out of the way (I don't have any understanding of their languages to make judgements, but I suspect that most of what they might need from a Rails app is supported with UTF-8 - it would just require transcoding because of the enormous amoutn of other Kanji data already in the wild).

It is really that simple. Some 60 lines of String rewiring get you very far, they free you from slicing characters, they get you normal reverse() and index() mechanics etc. But - this is not "really" the pie of ActiveSupport, because it overrides and rewires a substantial CORE language feature. And if one would say "it's nasty to override the core language" I would agree - but not in the case of Rails. Currently, a rewired String class would provide _exactly_ the same functionality as the default String class outlined in Ruby2.0 by Matz (character oriented vs. byte oriented - and that's how it works now for ASCII).

So the question is quite simple.

Is this a viable path? Fix String for UTF-8 users once and for all and get a substantial part of Rails to be multibyte-safe actually _for free_, or go on, sticking our heads in the sand, finding bugs in Rails itself and (temporarily) healing the symptoms instead of the malady?

This brings in another issue of Unicode support. The Python and Perl ways of doing it are to distinguish between a "bytestring" and a "unicode string". This is a way of the apocalypse. It implies that every developer, in every function, in every subroutine and every block call must explicitly cast one into the other (because you never can be sure which one you are getting). MovableType circumvents this by processing ALL as bytestrings (doing the unpack+pack voodoo to shake "off" the UTF flag), other packages do other things - but the problem STICKS, because all of the developers prefer to output "normal" bytestrings and get them in as well. Which has led me to a simple realisation:

* * * * As long as multibyte support is optional, nobody gives a sh..t if it works.

Let's take a simple example. Someone makes a helper that truncates the excerpt of the entry automatically to N characters. Let's ask ourselves: if he wanted to do it properly, would he look into the library "ActiveSupport" which would add "safe_truncate" to String or would he just call string[0..len] ? What would you do?

ActiveSupport is a vey good and vast Ruby extension module. Why couldn't we add something _really_ important to it instead of syntactic sugar only? Something that really many people need? Something that would fix all the stack UNDER the Rails components so that nobody even has to THINK about bugs like #2103, if not only for the reason of the ignorance of the developer alone (like in the post by Lucas I've linked to)?

What do you think? Please note that I am heavily biased because every single piece of software I used since I was 12 had problems with Russian letters, and Rails is no exception 10 years later, on a fully Unicode-capable Unix box. If the core language has to be bent INTO shape (I call this "into" rather than "out of") to make things Just Work, why not?

--
Julian 'Julik' Tarkhanov
me at julik.nl



_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core

Reply via email to