Re: [Rails-core] Overview of Ruby 1.9 encoding problem tickets

Jeremy Kemper Mon, 19 Apr 2010 11:30:34 -0700

On Mon, Apr 19, 2010 at 6:58 AM, Czarek <[email protected]> wrote:
> SUMMARY:
> --------
>
> I tried to identify the general and root causes for these problems
> with 1.9, by taking into account non-utf encoding, current patches,
> comments and ideas. I used ticket #2188 as base for explanations.
>
> This is a long read. I wanted to include all the relevant information
> in one place. I also included information about related tickets in LH
> and their status. I decided that adding parts of this to LH would just
> add to the confusion.
>
> Two patches are included (one is from Andrew Grim) that should fix one
> issue (#2188) in a way, that fixes the problem and doesn't break
> anything. Two small steps for Rails, one giant step for proper
> encoding support. I hope.
>
> I welcome any feedback that would help get Rails closer to fully
> supporting Ruby 1.9 and vice-versa.
>
> SOLUTION:
> ---------
>
> The general idea is: allow only one "internal" encoding in Rails at
> any given time, based on the default Ruby encoding (or configurable).
>
> And treat any incoming external strings that cannot be converted to
> this "internal" encoding as errors in the gems, which they occur. And
> possibly report mismatches before they even "enter" Rails, by
> attempting to convert them into the "internal" encoding immediately.
>
> As a result of enforcing this, all Rails tests should work with any
> encoding, that is a superset of the encodings used for input (db,
> Rack, ERB, Haml, ...) in a given environment.
>
> With a optimal setup (db encoding, Ruby encoding, Rack encoding
> settings, I18n translations, ...), no transcoding will occur during
> the rendering process, no matter what the default Rails encoding is
> used (including ASCII_8BIT), and no force_encoding would be needed
> internally in Rails, except as workarounds for gems and libraries
> where this is difficult otherwise.
>
> The guideline for gem and plugin developers would be: do not create or
> return strings (other than internal use) that are not compatible with
> the default encoding both ways.
>
> In some cases, it may be acceptable to drop or escape characters that
> cannot be transcoded (maybe Rack input, for example).


+1


> The idea is based on:
>
>  - Jeremy Kemper's strong attitude toward avoiding solutions
>    requiring UTF-8 as default or forcing it
>
>  - Yehuda's opinion about using UTF-8 as default in Ruby instead of
>    ASCII-8BIT
>
>  - James Edward Gray's solution for encoding issues in CSV
>
>  - the multitude of ways to set the encoding in Ruby
>
>  - giving everyone the liberty to use any encoding they want for any
>    task, without the need of porting and modifying existing code if
>    possible
>
>  - personal experience with many encoding pitfalls
>
>
> For those interested in Ruby encoding support, I very much recommend
> the extremely well written in-depth article by James Edward Gray II:
>
>    http://blog.grayproductions.net/articles/understanding_m17n
>
>
> Results of "Please do investigate":
> ----------------------------------
>
> The ticket:
>
>  #2188: (March 9th, 2009):  Encoding error in Ruby1.9 for templates
>
> Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an
> alias for "BINARY". This is actually ok, except for the way Ruby 1.9
> handles concat with a non-BINARY string, e.g. UTF-8:
>
>  >> '日本'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8'))
>  Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT 
> and UTF-8
>
> Although the following works (equivalent to how Ruby 1.8 works):
>
>  >> '日本'.force_encoding('BINARY').concat('語'.force_encoding('BINARY'))
>  => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
>
> The surprise is that it "sometimes works", when a string contains only valid 
> ASCII-7
> characters, giving the impression that a patch fixed the problem:
>
>  >> 'abc'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8'))
>  => "abc語"
>
> (I used force_encoding here for consistency in different locale
> settings).
>
> Solutions that come into mind:
> -----------------------------
>
>  1. force_encoding should not be used, unless really necessary, and
>  this rule should be applied to ERB. Unfortunately, I have no idea
>  why ERB uses force_encoding, but I can come up with a few reasons,
>  the main one being: Rails uses ERB (a general lib) for a specific
>  purpose and requiring a non-ASCII-8BIT encoding is just as specific.
>  I would really like an opinion on this.

I don't know why ERB forces encoding to ASCII-8BIT in the absence of a
magic comment. See r21170. The ERB compiler should probably take a
default source encoding option that's used if the magic comment is
missing.

>  2. Don't use ERB. AFAIK, this is why Rails 3.0 works.

Using Erubis is a possibility as well.

>  3. Treat everything as binary, since the resulting file is sent to a
>  browser, which will detect the encoding anyway. This is also doesn't
>  affect performance, but it ruins the whole idea of having encoding
>  support, possibly breaking test frameworks instead.

-1

>  4. Force UTF-8. This is the brute-force idea used in many patches
>  and workarounds, and this prevents commits from happening. People
>  should have a right to use non-utf8 ERB files and render in any
>  encoding e.g. EUC-JP.

-1

>  5. Try to be intelligent, and guess. This means handling
>  everything, except BINARY. The problem is how do we know what
>  encoding to use for template input? And what encoding do we use for
>  output?

We could set a single default encoding for the app, like we're doing in Rails 3.

> Solution 1 would be best, but with force_encoding already in the wild
> with Ruby 1.9, including ruby-head.  So that leaves solution 5. Option
> 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require
> all template input strings to be set to BINARY.
>
> Solution 5
> ----------
>
>  force_encoding has to be used at least once somewhere in
>  Rails - to fix what ERB "breaks", but on what basis should the
>  encoding be selected? For performance, there should be no
>  transcoding during rendering, unless absolutely necessary.
>
>  When we think about it, the output depends on what we want the
>  browser to receive, and that is why many people are pushing UTF-8:
>  the layout usually has UTF-8 anyway, and it would otherwise have to
>  be parsed to get the encoding from the content-type value.
>
>  The input using in rendering a template is a mixture of what web
>  designers provide, the translators use, the databases return and
>  Rack emits, among other things.
>
>  The policy in Rails could be: "don't allow multiple encodings
>  during template rendering". I believe the effort required to do
>  otherwise is not be justified.
>
>  This would force other gem developers to provide a way to set or
>  read the correct encoding they use or stick with the current
>  default. In this case (#2188), ERB has to either provide a way to
>  either return the result in a encoding specified by Rails, or the
>  ERB handler should be adapted to provide this functionality.
>
>  The problem with this: ERB templates do not have an embedded
>  encoding. Which means we need a way to specify the encoding used in
>  the template.
>
>  Andrew Grim fixes this in his patch here:
>
>  https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff
>
> I am only worried about the default case, when no encoding is set.
> "ASCII_8BIT", the result of ERB, is not acceptable, unless the
> "internal" encoding would also be BINARY. I would propose merging the
> following with the patch above:
>
>      def compile(template)
>        input = "<% __in_erb_template=true %>#{template.source}"
>        src = ::ERB.new(input, nil, erb_trim_mode, '@output_buffer').src
>
>        if RUBY_VERSION >= '1.9' and src.encoding != input.encoding
>          if src.encoding == Encoding::ASCII_8BIT
>            src = src.force_encoding(input.encoding) #ERB workaround
>          else
>            src = src.encode(input.encoding)
>          end
>        end
>
>        # Ruby 1.9 prepends an encoding to the source. However this is
>        # useless because you can only set an encoding on the first line
>        RUBY_VERSION >= '1.9' ? src.sub(/\A#coding:.*\n/, '') : src
>      end

The ERB compiler is supposed to preserve the input file's source
encoding unless it has a magic comment. Puzzled why this is necessary.
It should also be fixed in ERB itself, I think.

>  And here is an example test case, similar to many others already in
>  the tickets, which shows the issue:
>
>    <%= "日本" %><%= "語".force_encoding("UTF-8") %>
>
> A few things here to note (for both patches put together):
>
>  - the fallback encoding would be assumed to be the same as ruby
>    default, which can be set by the locale, RUBYOPT with -K option,
>    or using Encoding.default_*. I believe this is sufficient
>    flexibility.
>
>  - note that there are no assumptions regarding the charset and the
>    ASCII_8BIT case is handled with this in mind
>
>  - obviously, test cases would be executed with different Ruby
>    encoding defaults - testing one setup no longer guarantees
>    anything. Rails tests should work with almost any default
>    encoding, which means testing at least on 3 should be recommended
>    before a patch is committed: (BINARY + UTF-8 + EUC ?).
>
>  - similar conversion to the "internal" encoding would be required
>    for all strings from other engines, databases and Rack, regardless
>    of whether they are in UTF-8 or not. As for Rack and strings
>    submitted through forms, they should ultimately be also in the
>    "internal" encoding and not BINARY (unless "internal" *is*
>    BINARY), but getting this to work is a can of worms in itself
>    (AFAIK, this is true for native Japanese sites, where assuming
>    UTF-8 is almost never valid).
>
>  - there are a few other places where ERB is used, but I prefer to
>    leave that until this single case is solved. Fixing other
>    template issues should be done separately.
>
> I hope this is enough to be committed into 2-3-stable, IMHO. At least
> as a first step after many months of threads, discussions, issues,
> tickets, articles, without any fully acceptable patches or progress.
>
> Also, I believe the tickets in LH need some love - just to straighten
> out the issue and introduce more clarity. The best results would be to
> start closing the tickets with definite conclusions and guidelines, so
> that people start using Ruby 1.9 with Rails, so plugin developers in
> turn get enough time and feedback to get things right.
>
> IMPORTANT: I had intention of offending anyone by the following
> digests - I just wanted to provide an overview of the lack of
> progress, the complexity of issue and the willingness to help, despite
> months without progress.  I admit I have no idea what prevented the
> problem from being solved a long time ago.
>
>  Ticket #2188:
>  https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038
>    1. Incorrect mention of I18n and #2038 as similar error
>    2. Correctly identified problem (Hector E. Gomez Morales)
>    3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector)
>    4. Unintentional hijacking with a MySQL problem (crazy_bug)
>    5. MySQL DB problem redirected to #2476 (Hector)
>    6. Unintentional hijacking with a HAML problem (Portfonica)
>    7. Jakub Kuźma identifies a wider set of problems
>    8. Jakub Kuźma identifies Rack problems
>    9. Adam S talks about setting default encoding in Rails
>    10. Jérôme points out the need for a default encoding for erb
>    files
>    11. Jeremy Kemper notes that the reports are not really helpful
>    12. Rocco Di Leo provides detailed test case, but formatting
>    problems make it unreadable
>    13. Adam S suggests solving the problem by converting ASCII ->
>    UTF8
>    14. hkstar mentions the lack of progress
>    15. Jeremy Kemper notes that the issue still hasn't been properly
>    investigated
>    16. Turns into a discussion about UTF-8 support in 1.9
>    17. Andrew Grim proposes alternative patch that honors ERB
>    template encoding
>    18. ahaller notes strange behaviour in ERB
>    19. Marcello Barnaba proposes general monkey patch for ActionView,
>    probably related to Rack issues
>    20. UVSoft proposes patch for HAML
>    21. Alberto describes the problem - just as Hector did
>    22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH
>
>    What I propose is combining the two patches above to close this
>    issue, and give references to non-related tickets which give a
>    similar error.

Ok, good. They'll need to be rebased against master, and I think
Andrew's patch breaks some tests since it changes the ERB line
numbers.


>  #Ticket 1988: Make utf8 partial rendering from within a content_for work in 
> ruby1.9
>  https://rails.lighthouseapp.com/projects/8994/tickets/1988
>    1. Patch that works around the issue
>    2. Jeremy Kemper does not accept the patch due to being utf-8 - only
>    3. TICKET STATUS IS INCOMPLETE
>
>    What I propose is solving #2188 first and then investigate this
>    bug further - it could be a bad assumption about the encoding of
>    strings returned by tag helpers in a specific case.
>
>  #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 
> 1.9.1
>  https://rails.lighthouseapp.com/projects/8994/tickets/2476
>    1. Hector describe database adaptor problem with 1.9 encodings,
>    provides a mysql-ruby fork and other links
>    2. Patches and fixes for databases / adaptors (James Healy, Jakub
>    Kuźma, Yugui)
>    3. Talk about assuming UTF-8 for databases
>    4. Loren Segal proposes hack instead of modifying mysql-ruby
>    5. Micheal Hasensein asks about issue 5 months later
>    6. UVSoft accidentally posts HAML workaround
>    6. TICKET STATUS IS NEW
>
>    My proposal - after fixing #2188, a short description of
>    adapters/databases and fixed versions could be presented - and
>    possibly have this issue closed, to prevent it being listed as a
>    pending UTF-8 issue. Work could be started on validation code for
>    the strings returned by database adapters and their compatibility
>    with the "internal" encoding.

+1


>    Open/new tickets related to Rack:
>
>    
> https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app
>    
> https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio
>    
> https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors
>
>    My proposal: gather issues and investigate with the help of people
>    working with non-utf and non-ascii input - I believe Japan is such
>    a country, where UTF-8 assumptions about Rack input are wrong.

Rack is woefully lagging on encoding support. It needs an encoding
push of its own.

Ruby CGI has updated to include just-enough support, e.g. for giving
an encoding for parsed query parameters.

> I would like to thank everyone who invested even the slightest bit of
> time in solving this issue.
>
> I hope the information here will help find a solution that will work
> without issues for years to come and that creating Rails applications
> will be an enjoyable experience for users, designers, developers,
> translators and all contributors, regardless of their environment and
> language preferences.

Indeed! Thanks for leading the charge, Cezary.

jeremy

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Core" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-core?hl=en.

Re: [Rails-core] Overview of Ruby 1.9 encoding problem tickets

Reply via email to