Hello to everyone on the Core.
Recently I promised Joshua Harvey (of the Globalize plugin fame) to
investigate the Rails code for possible multibyte issues. Pity that I
didn't have much time to do it quickly, but my findings are sad
(although productive). Not so long ago I have filed a bug #2103 which
got a prompt fix by Jamis.
The name of the bug reads: truncate() helper is not multibyte-safe
The actual name of it should have been: String#[] method is broken
for multibyte strings
Yes, this is not a Rails problem. Most of the String methods in Ruby
are not mb-safe, although String implies working with characters
instead of bytes. To fix the bug I filed, Jamis needed to introduce
ALL THAT (http://dev.rubyonrails.org/changeset/2265) for the fix and
the test, including a special "sandbox" mode to test the effects of
the helper. I assume that for every situation where a bug like this
is found, just as many lines are going to be needed (sandboxed test +
code fork at the end-user API). And I investigated how much of these
might that be.
The response is the following: all of Rails. Take a look, for
example, at this file within ActiveSupport.
http://dev.rubyonrails.org/browser/trunk/activesupport/lib/
active_support/core_ext/string/access.rb
Let me tell you, all of this is broken. It's broken in Ruby and it
stays broken in Rails. Because when you feed them multibyte strings
you better be lucky that your Range covers the complete codepoints -
otherwise you invalidate your output for ANY meaningful use (XML,
conversion to another encoding etc.) - you can slice "into" a
character and you will. And there is a very big problem which adds
insult to injury.
_Most Rails developers will never notice_. Why, you ask? Well,
here's the answer.
By default, Ruby uses UTF-8 for the "unicoded" $KCODE setting. In
UTF-8, all Latin-1 characters actually stay single-byte, so you would
never damage them by using "foobar"[0..2]. And you would always get
correct "reversed" string.
But as soon as you pop ONE umlaut in there, as soon as you enter ONE
character which is not single-byte you introduce an error. Recently,
I read this entry on the blog of Lucas Carlson.
http://tech.rufy.com/entry/93
Guess, WHY he is advising me to use "require 'jcode"? Because he
never notices that his handling is broken until he both:
a) actually enters a multibyte character into his string
c) this multibyte character happens to exist right at the "slice" of
the Range
Same with ActiveSupport. Essentially speaking, all of the problems
that Ruby has with regards to multibyte handling, persist well into
Rails, up to it's uppermost layers (such as RJS). Moreover - this is
actually a tip of the iceberg. If we try to discover and file EVERY
bug that appears in Rails with regards to multibyte handling,
hundreds lines of code will appear to fix the issue at the wrong
level of the stack.
Let's see. To handle Unicode properly in a web app, we actually need
it correctly and transparently handled across the following stacks:
[ database ] -- should normalize, store and sort
[ database driver ] -- should set the right client encoding
[ ruby ] -- should operate on strings properly <<BROKEN>>
[ rails ] -- should set the right headers and coodrinate input and
output
--------
[ web-server ] -- should not do any implicit reencoding (some do, too
long to explain here)
[ proxy etc. ] - same as above
[ browser ] - should display and accept multibyte characters properly
Now the problem is, that fixes such as the one for truncate() are NOT
the solution, because they fix what has to be fixed in Ruby itself.
If we look at this part of the stack more closely, we will see
(pardon my ASCII):
[ Ruby ]
[ Rails
[ [ ActiveSupport]
[ [ AR], [AP], [AWS], .....
Which means that while we are working within Rails, we can always
expect ActiveSupport to be available! Otherwise we wouldn't have
things such as symbolize_keys!, 20.days.from.now etc.
Now, Matz is promising proper multibyte Strings for Ruby 2.0 The
trouble with this is that we never know WHEN it's coming - it's being
promised for years, and the emails on "broken Unicode" in Ruby just
keep coming on ruby.lang.
So instead of reviewing ALL the (already immense) Rails codebase, I
have a simple question.
We have a number of dependencies. We know that String IS BROKEN and
it needs to be rewired. We know that most Multibyte-aware code is not
using String#methods, but Rails does use them. And we know that
ActiveSupport is implied.
OTOH, we know the following:
a) most of Rails developers are not using EUC-JP or JIS
b) the ones that NEED multibyte strings are using UTF-8
c) the ones that THINK they DON'T need have a BIG problem and need a
slap on their head
d) the regex engine we have now is already much more multibyte-aware
than the String methods
e) jcode.rb does something, but it's NOT enough
Because they stand a chance of being bitten by the issue as soon as a
First User Types The First Double-Byte Character Into One Of Their
Forms. After that you can expect many many nastities to happen.
And on the other hand we have the Unicode gem. While Ruby 2.0 is long
from finished, we already have Unicode-aware case conversions,
Unicode-aware normalization and decomposition. All of these can be
easily wired into the String class itself to provide _out_of_the_box_
fixes to multibyte issues for EVERYONE who:
a) has the Unicode gem (I don't know how to get it running on Win32)
b) uses UTF8 as his KCODE (which right now is a Rails requirement for
using multibyte strings)
c) is running under ActiveSupport loaded
This also can be made optional (like ActiveSupport::use_utf8() pragma-
like statement) For people using EUC-JP and other Kanji systems we
really have to step out of the way (I don't have any understanding of
their languages to make judgements, but I suspect that most of what
they might need from a Rails app is supported with UTF-8 - it would
just require transcoding because of the enormous amoutn of other
Kanji data already in the wild).
It is really that simple. Some 60 lines of String rewiring get you
very far, they free you from slicing characters, they get you normal
reverse() and index() mechanics etc. But - this is not "really" the
pie of ActiveSupport, because it overrides and rewires a substantial
CORE language feature. And if one would say "it's nasty to override
the core language" I would agree - but not in the case of Rails.
Currently, a rewired String class would provide _exactly_ the same
functionality as the default String class outlined in Ruby2.0 by Matz
(character oriented vs. byte oriented - and that's how it works now
for ASCII).
So the question is quite simple.
Is this a viable path? Fix String for UTF-8 users once and for all
and get a substantial part of Rails to be multibyte-safe actually
_for free_, or go on, sticking our heads in the sand, finding bugs in
Rails itself and (temporarily) healing the symptoms instead of the
malady?
This brings in another issue of Unicode support. The Python and Perl
ways of doing it are to distinguish between a "bytestring" and a
"unicode string". This is a way of the apocalypse. It implies that
every developer, in every function, in every subroutine and every
block call must explicitly cast one into the other (because you never
can be sure which one you are getting). MovableType circumvents this
by processing ALL as bytestrings (doing the unpack+pack voodoo to
shake "off" the UTF flag), other packages do other things - but the
problem STICKS, because all of the developers prefer to output
"normal" bytestrings and get them in as well. Which has led me to a
simple realisation:
* * * * As long as multibyte support is optional, nobody gives a
sh..t if it works.
Let's take a simple example. Someone makes a helper that truncates
the excerpt of the entry automatically to N characters. Let's ask
ourselves: if he wanted to do it properly, would he look into the
library "ActiveSupport" which would add "safe_truncate" to String or
would he just call string[0..len] ? What would you do?
ActiveSupport is a vey good and vast Ruby extension module. Why
couldn't we add something _really_ important to it instead of
syntactic sugar only? Something that really many people need?
Something that would fix all the stack UNDER the Rails components so
that nobody even has to THINK about bugs like #2103, if not only for
the reason of the ignorance of the developer alone (like in the post
by Lucas I've linked to)?
What do you think? Please note that I am heavily biased because every
single piece of software I used since I was 12 had problems with
Russian letters, and Rails is no exception 10 years later, on a fully
Unicode-capable Unix box. If the core language has to be bent INTO
shape (I call this "into" rather than "out of") to make things Just
Work, why not?
--
Julian 'Julik' Tarkhanov
me at julik.nl
_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core