On Sun, Sep 28, 2008 at 5:01 AM, SpringFlowers AutumnMoon <[EMAIL PROTECTED]> wrote:
> it seems that there is no parameter for the function h() (html_escape()) > to indicate the character encoding being used? > > for PHP, its htmlspecialchars() function has a dozen encoding possible, > such as UTF-8, Chinese Big5, Chinese GB, Russia, Japanese. > > i think thought, h() will work for UTF-8, since h() will only touch the > 4 special characters > > < > & " > > and replace them with < etc and those 4 characters are all in the > 0x00 to 0x7F range, and h() will leave the other bytes intact > (unchanged). Now, since a character in UTF-8 can be 1 to 4 bytes, and > that any ASCII will be represented as 1 byte, which is 0x00 to 0x7F > itself, and that 0x80 to 0xFF and other unicode characters will be 2 to > 4 bytes long, but with the 1st to 4th bytes all being in the 0x80 to > 0xFF range (see UTF-8 http://en.wikipedia.org/wiki/Utf-8 ), so when h() > replaces those 4 ASCII characters, it will successfully do so when h() > sees those 4 characters as a 1-byte character, and then it will bypass > all the 1st to 4th bytes characters because those characters are in the > 0x80 to 0xFF range, and therefore can never be matched as one of those 4 > special characters, so the job of replacing those 4 characters will be > done with no side effect whatsoever done to the non-ASCII characters. Ruby 1.8 has a global idea of character enconding, which is configured in the $KCODE global variable. Rails 1.2 and above by default set $KCODE to a value that means everything is UTF-8. Source code, strings, regexps, etc. It also sets a HTTP header that tells the client (X)HTML goes as UTF-8. Thus, the client sends form data back in UTF-8 as well. And everything works transparently. When you do I/O you are responsible for knowing the encoding of incoming data, and the expected encoding of outgoing data. You use iconv if needed to guarantee them. Any I/O operation has to be in control of the involved character encodings. Some stuff in Ruby 1.8 does not play well with UTF-8, for example you cannot compute the length of a string with String#length because that method counts bytes. But some other stuff do work, like pattern matching. For example "." really matches a character, which may not be a byte in UTF-8, as you point out. So, if you are using regexps you are safe in that regard. The helper #h is really an ERb alias of the ERb method #html_escape (it is not a Rails helper), and that method is implemented using regexps: def html_escape(s) s.to_s.gsub(/&/, "&").gsub(/\"/, """).gsub(/>/, ">").gsub(/</, "<") end Hence, it works correctly in UTF-8. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---

