On Sun, Sep 28, 2008 at 5:01 AM, SpringFlowers AutumnMoon
<[EMAIL PROTECTED]> wrote:

> it seems that there is no parameter for the function h() (html_escape())
> to indicate the character encoding being used?
>
> for PHP, its htmlspecialchars() function has a dozen encoding possible,
> such as UTF-8, Chinese Big5, Chinese GB, Russia, Japanese.
>
> i think thought, h() will work for UTF-8, since h() will only touch the
> 4 special characters
>
>  <  >   &   "
>
> and replace them with &lt;  etc and those 4 characters are all in the
> 0x00 to 0x7F range, and h() will leave the other bytes intact
> (unchanged).  Now, since a character in UTF-8 can be 1 to 4 bytes, and
> that any ASCII will be represented as 1 byte, which is 0x00 to 0x7F
> itself, and that 0x80 to 0xFF and other unicode characters will be 2 to
> 4 bytes long, but with the 1st to 4th bytes all being in the 0x80 to
> 0xFF range (see UTF-8 http://en.wikipedia.org/wiki/Utf-8 ), so when h()
> replaces those 4 ASCII characters, it will successfully do so when h()
> sees those 4 characters as a 1-byte character, and then it will bypass
> all the 1st to 4th bytes characters because those characters are in the
> 0x80 to 0xFF range, and therefore can never be matched as one of those 4
> special characters, so the job of replacing those 4 characters will be
> done with no side effect whatsoever done to the non-ASCII characters.

Ruby 1.8 has a global idea of character enconding, which is configured
in the $KCODE global variable.

Rails 1.2 and above by default set $KCODE to a value that means
everything is UTF-8. Source code, strings, regexps, etc. It also sets
a HTTP header that tells the client (X)HTML goes as UTF-8. Thus, the
client sends form data back in UTF-8 as well. And everything works
transparently.

When you do I/O you are responsible for knowing the encoding of
incoming data, and the expected encoding of outgoing data. You use
iconv if needed to guarantee them. Any I/O operation has to be in
control of the involved character encodings.

Some stuff in Ruby 1.8 does not play well with UTF-8, for example you
cannot compute the length of a string with String#length because that
method counts bytes. But some other stuff do work, like pattern
matching. For example "." really matches a character, which may not be
a byte in UTF-8, as you point out.

So, if you are using regexps you are safe in that regard. The helper
#h is really an ERb alias of the ERb method #html_escape (it is not a
Rails helper), and that method is implemented using regexps:


   def html_escape(s)
     s.to_s.gsub(/&/, "&amp;").gsub(/\"/, "&quot;").gsub(/>/,
"&gt;").gsub(/</, "&lt;")
   end

Hence, it works correctly in UTF-8.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to