Well, I see that my last email hasn't generated any reaction from the Rails core team. It looks like all of them are the happy users of "plain text" (which, as we know by now, doesn't exist, but still).

I apologize in advance for the sore bitterness of this message but I see that the Rails-core STILL, despite all of the efforts, sees these issues as something you can YAGNI away, something "optional", "additional" or "plugin-able".

What I will try to prove in this message is that it's not "additional" - and more, it's got poisonous teeth and it bites painfully. You can forgive Matz, because he has to stay above the controversy and cater to the Japanese and Chinese users, and he dislikes Unicode (like most of the enlightened Japanese do). But you can't forgive _yourself_ because these are _your_ aplications. As a developer, you are accountable.

In my first email I was talking about the low-level mechanics of this stuff. They are interesting for a Ruby internals developer or to a deep-down Ruby hacker (like Jamis), but I haven't touched the consequences that Rails gets from this (because I thought I won't need to draw out this knife if there is interest). Turns out we (as per David and others) are still in the cozy world of "Plain Text" though, so it's time I better open this can of worms.

Let's skip on a second on all these nasty, disgusting "question mark in a rectangle" characters your users will see when you truncate their text improperly - this is, after all, the temple of Output, the browser domain - you sent it out, and then the browser has to cope with it according to Postel's law. And besides there are not many of them, right? Just some lousy 5,5 billions of potential customers, right? Uhm, sorry, got a little offtopic here.

Let's move somewhat up the stack in my previous message, into a different domain - the one you care about. The one you foster and cherish. The domain of Data.

Paul Battley had a good talk on the recent Eureko conference in Munich about Unicode in Ruby. Among his other slides he had "Doing mischief with Unicode". Unfortunately I couldn't attend because Eureko effectively was on my birthday, so I found other fish to fry on that day - but you can find Paul's presentation in it's gory MPEG4 here:

http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz

I will merely expand his presentation into Rails - that's right, we will exploit Rails with Unicode. Let's say you are storing your data in Unicode (because if you don't you must spend the rest of your days in Hell writing Sanskrit in octets on a concrete plate with a dinner fork). You think your bases are covered and you did require 'jcode'. Except that 'jcode' won't help.

Let's have a look at this nice little snippet.

class User
 validates_presence_of :login
end

Looks buletproof, isn't it? If a user enters spaces into the form they are going to get String#strip'ped, and then the text in the field is going to be String#blank?, right? So entering all spaces into the Login field won't work, right?

Well, it will. The Unicode standard, as of now, comprises 26 (!) characters which can be considered "whitespace". 26, that is, when used inside a string - when it's at the boundaries it gets 27 (including the zero-width space AKA BOM).

So let's try:

kinda_lovely_login = [
           0x0020,          # White_Space # Zs       SPACE
           0x00A0,          # White_Space # Zs       NO-BREAK SPACE
0x202F, # White_Space # Zs NARROW NO-BREAK SPACE
].pack("U*")

And lo and behold...

User.new(:login=>kinda_lovely_login).save!

Nice, isn't it? If you wonder - yes, this is an exploit existing in YOUR Rails application RIGHT NOW (albeit a mild one). That one application that is sooo-web 2.0, with Ajax and stuff. If you like it, you better switch to 7-bit ASCII right away before selling it to anyone (not that you will be succesful unless you only sell to the British and American customers, and as we all know, the Web ends there). And "just using UTF-8" won't help, because Unicode is hard.

You wonder WHY that happens? Well... String#strip is Unicode-unaware. As are String#empty? and (thusly) String#blank? But don't reach out for your fixtures just yet! Because I'm far from finished...

Let's move on:

class User
   validates_size_of :name,  :maximum=>5
end

Ok, this is our User. Now let's see if I can use this application:

my_name = [1070, 1083, 1080, 1082].pack("U*")

in case you wonder - this is my name in Russian, spelled like "Юлик". The one my mother gave to me.

User.new(:login=>'julik', :name=>my_name).save!

/usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/ active_record/validations.rb:711:in `save!': ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid)

Ahem, wait a minute. You said it was 5 right? And of course you show it to me in a nice little error message? But I gather that my name is as many as 4 letters, and it fits the boundaries quite nicely. Well, no. String#size is not Unicode-aware, as we know - so AR just sticks to that. And my name turns out to be quite a bit longer than what I thought it might be:

name.size
=> 8

Well, sure, Two-bytes per character. David can stick some of his nice Danish diacritics in there as well, because they ought to be double- byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal this from you as long as you stay in your cozy "plain-text" land. If you like it THAT way you better stick the following into the form:

"The length of your name decomposed into bytes should be less than, or equal to 5".

I bet your users will love that.

Now just do a grep on Rails sources for string.size (and friends). Enjoy the mess.

This is not "localization of dates and times", gentlemen, this is serious BAD. And if you still think these things are not serious and Rails can stay plain text, if you stil think this can be outsourced and YAGNI'ed away, if you think it doesn't "touch me because most of my customers are American anyways", if you think you can sell THIS to the pointy-haired bossed, or if you think Matz (and other Japanese) will take care of it for you -- I admire you. Keep countin' em' bytes.

--
Julian 'Julik' Tarkhanov
me at julik.nl



_______________________________________________
Rails-core mailing list
Rails-core@lists.rubyonrails.org
http://lists.rubyonrails.org/mailman/listinfo/rails-core

Reply via email to