Well, I see that my last email hasn't generated any reaction from the
Rails core team. It looks like all of them are the happy users of
"plain text" (which, as we know by now, doesn't exist, but still).
I apologize in advance for the sore bitterness of this message but I
see that the Rails-core STILL, despite all of the efforts, sees these
issues as something you can YAGNI away, something "optional",
"additional" or "plugin-able".
What I will try to prove in this message is that it's not
"additional" - and more, it's got poisonous teeth and it bites
painfully. You can forgive Matz, because he has to stay above the
controversy and cater to the Japanese and Chinese users, and he
dislikes Unicode (like most of the enlightened Japanese do). But you
can't forgive _yourself_ because these are _your_ aplications. As a
developer, you are accountable.
In my first email I was talking about the low-level mechanics of this
stuff. They are interesting for a Ruby internals developer or to a
deep-down Ruby hacker (like Jamis), but I haven't touched the
consequences that Rails gets from this (because I thought I won't
need to draw out this knife if there is interest). Turns out we (as
per David and others) are still in the cozy world of "Plain Text"
though, so it's time I better open this can of worms.
Let's skip on a second on all these nasty, disgusting "question mark
in a rectangle" characters your users will see when you truncate
their text improperly - this is, after all, the temple of Output, the
browser domain - you sent it out, and then the browser has to cope
with it according to Postel's law. And besides there are not many of
them, right? Just some lousy 5,5 billions of potential customers,
right? Uhm, sorry, got a little offtopic here.
Let's move somewhat up the stack in my previous message, into a
different domain - the one you care about. The one you foster and
cherish. The domain of Data.
Paul Battley had a good talk on the recent Eureko conference in
Munich about Unicode in Ruby. Among his other slides he had "Doing
mischief with Unicode". Unfortunately I couldn't attend because
Eureko effectively was on my birthday, so I found other fish to fry
on that day - but you can find Paul's presentation in it's gory MPEG4
here:
http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz
I will merely expand his presentation into Rails - that's right, we
will exploit Rails with Unicode. Let's say you are storing your data
in Unicode (because if you don't you must spend the rest of your days
in Hell writing Sanskrit in octets on a concrete plate with a dinner
fork). You think your bases are covered and you did require 'jcode'.
Except that 'jcode' won't help.
Let's have a look at this nice little snippet.
class User
validates_presence_of :login
end
Looks buletproof, isn't it? If a user enters spaces into the form
they are going to get String#strip'ped, and then the text in the
field is going to be String#blank?, right? So entering all spaces
into the Login field won't work, right?
Well, it will. The Unicode standard, as of now, comprises 26 (!)
characters which can be considered "whitespace". 26, that is, when
used inside a string - when it's at the boundaries it gets 27
(including the zero-width space AKA BOM).
So let's try:
kinda_lovely_login = [
0x0020, # White_Space # Zs SPACE
0x00A0, # White_Space # Zs NO-BREAK SPACE
0x202F, # White_Space # Zs NARROW NO-BREAK
SPACE
].pack("U*")
And lo and behold...
User.new(:login=>kinda_lovely_login).save!
Nice, isn't it? If you wonder - yes, this is an exploit existing in
YOUR Rails application RIGHT NOW (albeit a mild one). That one
application that is sooo-web 2.0, with Ajax and stuff. If you like
it, you better switch to 7-bit ASCII right away before selling it to
anyone (not that you will be succesful unless you only sell to the
British and American customers, and as we all know, the Web ends
there). And "just using UTF-8" won't help, because Unicode is hard.
You wonder WHY that happens? Well... String#strip is Unicode-unaware.
As are String#empty? and (thusly) String#blank? But don't reach out
for your fixtures just yet! Because I'm far from finished...
Let's move on:
class User
validates_size_of :name, :maximum=>5
end
Ok, this is our User. Now let's see if I can use this application:
my_name = [1070, 1083, 1080, 1082].pack("U*")
in case you wonder - this is my name in Russian, spelled like "Юлик".
The one my mother gave to me.
User.new(:login=>'julik', :name=>my_name).save!
/usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/
active_record/validations.rb:711:in `save!':
ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid)
Ahem, wait a minute. You said it was 5 right? And of course you show
it to me in a nice little error message? But I gather that my name is
as many as 4 letters, and it fits the boundaries quite nicely. Well,
no. String#size is not Unicode-aware, as we know - so AR just sticks
to that. And my name turns out to be quite a bit longer than what I
thought it might be:
name.size
=> 8
Well, sure, Two-bytes per character. David can stick some of his nice
Danish diacritics in there as well, because they ought to be double-
byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal
this from you as long as you stay in your cozy "plain-text" land. If
you like it THAT way you better stick the following into the form:
"The length of your name decomposed into bytes should be less than,
or equal to 5".
I bet your users will love that.
Now just do a grep on Rails sources for string.size (and friends).
Enjoy the mess.
This is not "localization of dates and times", gentlemen, this is
serious BAD. And if you still think these things are not serious and
Rails can stay plain text, if you stil think this can be outsourced
and YAGNI'ed away, if you think it doesn't "touch me because most of
my customers are American anyways", if you think you can sell THIS to
the pointy-haired bossed, or if you think Matz (and other Japanese)
will take care of it for you -- I admire you. Keep countin' em' bytes.
--
Julian 'Julik' Tarkhanov
me at julik.nl
_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core