Re: [Rails-core] Investigating Unicode. Take 2, with nastities and allegations.

Kyle Maxwell Thu, 22 Dec 2005 14:13:41 -0800

On 12/21/05, Julian 'Julik' Tarkhanov <[EMAIL PROTECTED]> wrote:
> Well, I see that my last email hasn't generated any reaction from the
> Rails core team. It looks like all of them are the happy users of
> "plain text" (which, as we know by now, doesn't exist, but still).
>
> I apologize in advance for the sore bitterness of this message but I
> see that the Rails-core STILL, despite all of the efforts, sees these
> issues as something you can YAGNI away, something "optional",
> "additional" or "plugin-able".
>
> What I will try to prove in this message is that it's not
> "additional" - and more, it's got poisonous teeth and it bites
> painfully. You can forgive Matz, because he has to stay above the
> controversy and cater to the Japanese and Chinese users, and he
> dislikes Unicode (like most of the enlightened Japanese do). But you
> can't forgive _yourself_ because these are _your_ aplications. As a
> developer, you are accountable.
>
> In my first email I was talking about the low-level mechanics of this
> stuff. They are interesting for a Ruby internals developer or to a
> deep-down Ruby hacker (like Jamis), but I haven't touched the
> consequences that Rails gets from this (because I thought I won't
> need to draw out this knife if there is interest). Turns out we (as
> per David and others) are still in the cozy world of "Plain Text"
> though, so it's time I better open this can of worms.
>
> Let's skip on a second on all these nasty, disgusting "question mark
> in a rectangle" characters your users will see when you truncate
> their text improperly - this is, after all, the temple of Output, the
> browser domain - you sent it out, and then the browser has to cope
> with it according to Postel's law. And besides there are not many of
> them, right? Just some lousy 5,5 billions of potential customers,
> right? Uhm, sorry, got a little offtopic here.
>
> Let's move somewhat up the stack in my previous message, into a
> different domain - the one you care about. The one you foster and
> cherish. The domain of Data.
>
> Paul Battley had a good talk on the recent Eureko conference in
> Munich about Unicode in Ruby. Among his other slides he had "Doing
> mischief with Unicode". Unfortunately I couldn't attend because
> Eureko effectively was on my birthday, so I found other fish to fry
> on that day - but you can find Paul's presentation in it's gory MPEG4
> here:
>
> http://www.futurometer.com/320x240x15fps/Battley.Unicode.mov.gz
>
> I will merely expand his presentation into Rails - that's right, we
> will exploit Rails with Unicode. Let's say you are storing your data
> in Unicode (because if you don't you must spend the rest of your days
> in Hell writing Sanskrit in octets on a concrete plate with a dinner
> fork). You think your bases are covered and you did require 'jcode'.
> Except that 'jcode' won't help.
>
> Let's have a look at this nice little snippet.
>
> class User
>   validates_presence_of :login
> end
>
> Looks buletproof, isn't it? If a user enters spaces into the form
> they are going to get String#strip'ped, and then the text in the
> field is going to be String#blank?, right? So entering all spaces
> into the Login field won't work, right?
>
> Well, it will. The Unicode standard, as of now, comprises 26 (!)
> characters which can be considered "whitespace". 26, that is, when
> used inside a string - when it's at the boundaries it gets 27
> (including the zero-width space AKA BOM).
>
> So let's try:
>
> kinda_lovely_login = [
>            0x0020,          # White_Space # Zs       SPACE
>             0x00A0,          # White_Space # Zs       NO-BREAK SPACE
>             0x202F,          # White_Space # Zs       NARROW NO-BREAK
> SPACE
> ].pack("U*")
>
> And lo and behold...
>
> User.new(:login=>kinda_lovely_login).save!
>
> Nice, isn't it? If you wonder - yes, this is an exploit existing in
> YOUR Rails application RIGHT NOW (albeit a mild one). That one
> application that is sooo-web 2.0, with Ajax and stuff. If you like
> it, you better switch to 7-bit ASCII right away before selling it to
> anyone (not that you will be succesful unless you only sell to the
> British and American customers, and as we all know, the Web ends
> there). And "just using UTF-8" won't help, because Unicode is hard.
>
> You wonder WHY that happens? Well... String#strip is Unicode-unaware.
> As are String#empty? and (thusly) String#blank? But don't reach out
> for your fixtures just yet! Because I'm far from finished...
>
> Let's move on:
>
> class User
>     validates_size_of :name,  :maximum=>5
> end
>
> Ok, this is our User. Now let's see if I can use this application:
>
> my_name = [1070, 1083, 1080, 1082].pack("U*")
>
> in case you wonder - this is my name in Russian, spelled like "Юлик".
> The one my mother gave to me.
>
> User.new(:login=>'julik', :name=>my_name).save!
>
> /usr/local/lib/ruby/gems/1.8/gems/activerecord-1.13.2/lib/
> active_record/validations.rb:711:in `save!':
> ActiveRecord::RecordInvalid (ActiveRecord::RecordInvalid)
>
> Ahem, wait a minute. You said it was 5 right? And of course you show
> it to me in a nice little error message? But I gather that my name is
> as many as 4 letters, and it fits the boundaries quite nicely. Well,
> no. String#size is not Unicode-aware, as we know - so AR just sticks
> to that. And my name turns out to be quite a bit longer than what I
> thought it might be:
>
> name.size
> => 8
>
> Well, sure, Two-bytes per character. David can stick some of his nice
> Danish diacritics in there as well, because they ought to be double-
> byte too. And yes, the fact that Ruby uses UTF-8 will nicely conceal
> this from you as long as you stay in your cozy "plain-text" land. If
> you like it THAT way you better stick the following into the form:
>
> "The length of your name decomposed into bytes should be less than,
> or equal to 5".
>
> I bet your users will love that.
>
> Now just do a grep on Rails sources for string.size (and friends).
> Enjoy the mess.
>
> This is not "localization of dates and times", gentlemen, this is
> serious BAD. And if you still think these things are not serious and
> Rails can stay plain text, if you stil think this can be outsourced
> and YAGNI'ed away, if you think it doesn't "touch me because most of
> my customers are American anyways", if you think you can sell THIS to
> the pointy-haired bossed, or if you think Matz (and other Japanese)
> will take care of it for you -- I admire you. Keep countin' em' bytes.
>
> --
> Julian 'Julik' Tarkhanov
> me at julik.nl
>
>
>
> _______________________________________________
> Rails-core mailing list
> [email protected]
> http://lists.rubyonrails.org/mailman/listinfo/rails-core
>


Julian,

I think that everyone is with you about wanting great Unicode support
in Ruby.  However, to release of 1.0, all of the core team guys put in
massive effort to get the release out the door.  I imagine that they
need some recovery time.  Also, there's the holiday season, and many
people are spending time with friends and family.

Great Unicode support will happen sooner or later, and if you want
sooner, you should start working on a patch.  I'd love to contribute,
but I need to get through the holidays and a major product launch in
January first.

--
Kyle Maxwell
Chief Technologist
E Factor Media // FN Interactive
[EMAIL PROTECTED]
1-866-263-3261

_______________________________________________
Rails-core mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails-core

Re: [Rails-core] Investigating Unicode. Take 2, with nastities and allegations.

Reply via email to