On Mon, Apr 19, 2010 at 6:58 AM, Czarek <[email protected]> wrote: > SUMMARY: > -------- > > I tried to identify the general and root causes for these problems > with 1.9, by taking into account non-utf encoding, current patches, > comments and ideas. I used ticket #2188 as base for explanations. > > This is a long read. I wanted to include all the relevant information > in one place. I also included information about related tickets in LH > and their status. I decided that adding parts of this to LH would just > add to the confusion. > > Two patches are included (one is from Andrew Grim) that should fix one > issue (#2188) in a way, that fixes the problem and doesn't break > anything. Two small steps for Rails, one giant step for proper > encoding support. I hope. > > I welcome any feedback that would help get Rails closer to fully > supporting Ruby 1.9 and vice-versa. > > SOLUTION: > --------- > > The general idea is: allow only one "internal" encoding in Rails at > any given time, based on the default Ruby encoding (or configurable). > > And treat any incoming external strings that cannot be converted to > this "internal" encoding as errors in the gems, which they occur. And > possibly report mismatches before they even "enter" Rails, by > attempting to convert them into the "internal" encoding immediately. > > As a result of enforcing this, all Rails tests should work with any > encoding, that is a superset of the encodings used for input (db, > Rack, ERB, Haml, ...) in a given environment. > > With a optimal setup (db encoding, Ruby encoding, Rack encoding > settings, I18n translations, ...), no transcoding will occur during > the rendering process, no matter what the default Rails encoding is > used (including ASCII_8BIT), and no force_encoding would be needed > internally in Rails, except as workarounds for gems and libraries > where this is difficult otherwise. > > The guideline for gem and plugin developers would be: do not create or > return strings (other than internal use) that are not compatible with > the default encoding both ways. > > In some cases, it may be acceptable to drop or escape characters that > cannot be transcoded (maybe Rack input, for example).
+1 > The idea is based on: > > - Jeremy Kemper's strong attitude toward avoiding solutions > requiring UTF-8 as default or forcing it > > - Yehuda's opinion about using UTF-8 as default in Ruby instead of > ASCII-8BIT > > - James Edward Gray's solution for encoding issues in CSV > > - the multitude of ways to set the encoding in Ruby > > - giving everyone the liberty to use any encoding they want for any > task, without the need of porting and modifying existing code if > possible > > - personal experience with many encoding pitfalls > > > For those interested in Ruby encoding support, I very much recommend > the extremely well written in-depth article by James Edward Gray II: > > http://blog.grayproductions.net/articles/understanding_m17n > > > Results of "Please do investigate": > ---------------------------------- > > The ticket: > > #2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates > > Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an > alias for "BINARY". This is actually ok, except for the way Ruby 1.9 > handles concat with a non-BINARY string, e.g. UTF-8: > > >> '日本'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8')) > Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT > and UTF-8 > > Although the following works (equivalent to how Ruby 1.8 works): > > >> '日本'.force_encoding('BINARY').concat('語'.force_encoding('BINARY')) > => "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" > > The surprise is that it "sometimes works", when a string contains only valid > ASCII-7 > characters, giving the impression that a patch fixed the problem: > > >> 'abc'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8')) > => "abc語" > > (I used force_encoding here for consistency in different locale > settings). > > Solutions that come into mind: > ----------------------------- > > 1. force_encoding should not be used, unless really necessary, and > this rule should be applied to ERB. Unfortunately, I have no idea > why ERB uses force_encoding, but I can come up with a few reasons, > the main one being: Rails uses ERB (a general lib) for a specific > purpose and requiring a non-ASCII-8BIT encoding is just as specific. > I would really like an opinion on this. I don't know why ERB forces encoding to ASCII-8BIT in the absence of a magic comment. See r21170. The ERB compiler should probably take a default source encoding option that's used if the magic comment is missing. > 2. Don't use ERB. AFAIK, this is why Rails 3.0 works. Using Erubis is a possibility as well. > 3. Treat everything as binary, since the resulting file is sent to a > browser, which will detect the encoding anyway. This is also doesn't > affect performance, but it ruins the whole idea of having encoding > support, possibly breaking test frameworks instead. -1 > 4. Force UTF-8. This is the brute-force idea used in many patches > and workarounds, and this prevents commits from happening. People > should have a right to use non-utf8 ERB files and render in any > encoding e.g. EUC-JP. -1 > 5. Try to be intelligent, and guess. This means handling > everything, except BINARY. The problem is how do we know what > encoding to use for template input? And what encoding do we use for > output? We could set a single default encoding for the app, like we're doing in Rails 3. > Solution 1 would be best, but with force_encoding already in the wild > with Ruby 1.9, including ruby-head. So that leaves solution 5. Option > 3 is a way to get Ruby 1.9 to behave more like 1.8, but will require > all template input strings to be set to BINARY. > > Solution 5 > ---------- > > force_encoding has to be used at least once somewhere in > Rails - to fix what ERB "breaks", but on what basis should the > encoding be selected? For performance, there should be no > transcoding during rendering, unless absolutely necessary. > > When we think about it, the output depends on what we want the > browser to receive, and that is why many people are pushing UTF-8: > the layout usually has UTF-8 anyway, and it would otherwise have to > be parsed to get the encoding from the content-type value. > > The input using in rendering a template is a mixture of what web > designers provide, the translators use, the databases return and > Rack emits, among other things. > > The policy in Rails could be: "don't allow multiple encodings > during template rendering". I believe the effort required to do > otherwise is not be justified. > > This would force other gem developers to provide a way to set or > read the correct encoding they use or stick with the current > default. In this case (#2188), ERB has to either provide a way to > either return the result in a encoding specified by Rails, or the > ERB handler should be adapted to provide this functionality. > > The problem with this: ERB templates do not have an embedded > encoding. Which means we need a way to specify the encoding used in > the template. > > Andrew Grim fixes this in his patch here: > > https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff > > I am only worried about the default case, when no encoding is set. > "ASCII_8BIT", the result of ERB, is not acceptable, unless the > "internal" encoding would also be BINARY. I would propose merging the > following with the patch above: > > def compile(template) > input = "<% __in_erb_template=true %>#{template.source}" > src = ::ERB.new(input, nil, erb_trim_mode, '@output_buffer').src > > if RUBY_VERSION >= '1.9' and src.encoding != input.encoding > if src.encoding == Encoding::ASCII_8BIT > src = src.force_encoding(input.encoding) #ERB workaround > else > src = src.encode(input.encoding) > end > end > > # Ruby 1.9 prepends an encoding to the source. However this is > # useless because you can only set an encoding on the first line > RUBY_VERSION >= '1.9' ? src.sub(/\A#coding:.*\n/, '') : src > end The ERB compiler is supposed to preserve the input file's source encoding unless it has a magic comment. Puzzled why this is necessary. It should also be fixed in ERB itself, I think. > And here is an example test case, similar to many others already in > the tickets, which shows the issue: > > <%= "日本" %><%= "語".force_encoding("UTF-8") %> > > A few things here to note (for both patches put together): > > - the fallback encoding would be assumed to be the same as ruby > default, which can be set by the locale, RUBYOPT with -K option, > or using Encoding.default_*. I believe this is sufficient > flexibility. > > - note that there are no assumptions regarding the charset and the > ASCII_8BIT case is handled with this in mind > > - obviously, test cases would be executed with different Ruby > encoding defaults - testing one setup no longer guarantees > anything. Rails tests should work with almost any default > encoding, which means testing at least on 3 should be recommended > before a patch is committed: (BINARY + UTF-8 + EUC ?). > > - similar conversion to the "internal" encoding would be required > for all strings from other engines, databases and Rack, regardless > of whether they are in UTF-8 or not. As for Rack and strings > submitted through forms, they should ultimately be also in the > "internal" encoding and not BINARY (unless "internal" *is* > BINARY), but getting this to work is a can of worms in itself > (AFAIK, this is true for native Japanese sites, where assuming > UTF-8 is almost never valid). > > - there are a few other places where ERB is used, but I prefer to > leave that until this single case is solved. Fixing other > template issues should be done separately. > > I hope this is enough to be committed into 2-3-stable, IMHO. At least > as a first step after many months of threads, discussions, issues, > tickets, articles, without any fully acceptable patches or progress. > > Also, I believe the tickets in LH need some love - just to straighten > out the issue and introduce more clarity. The best results would be to > start closing the tickets with definite conclusions and guidelines, so > that people start using Ruby 1.9 with Rails, so plugin developers in > turn get enough time and feedback to get things right. > > IMPORTANT: I had intention of offending anyone by the following > digests - I just wanted to provide an overview of the lack of > progress, the complexity of issue and the willingness to help, despite > months without progress. I admit I have no idea what prevented the > problem from being solved a long time ago. > > Ticket #2188: > https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038 > 1. Incorrect mention of I18n and #2038 as similar error > 2. Correctly identified problem (Hector E. Gomez Morales) > 3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector) > 4. Unintentional hijacking with a MySQL problem (crazy_bug) > 5. MySQL DB problem redirected to #2476 (Hector) > 6. Unintentional hijacking with a HAML problem (Portfonica) > 7. Jakub Kuźma identifies a wider set of problems > 8. Jakub Kuźma identifies Rack problems > 9. Adam S talks about setting default encoding in Rails > 10. Jérôme points out the need for a default encoding for erb > files > 11. Jeremy Kemper notes that the reports are not really helpful > 12. Rocco Di Leo provides detailed test case, but formatting > problems make it unreadable > 13. Adam S suggests solving the problem by converting ASCII -> > UTF8 > 14. hkstar mentions the lack of progress > 15. Jeremy Kemper notes that the issue still hasn't been properly > investigated > 16. Turns into a discussion about UTF-8 support in 1.9 > 17. Andrew Grim proposes alternative patch that honors ERB > template encoding > 18. ahaller notes strange behaviour in ERB > 19. Marcello Barnaba proposes general monkey patch for ActionView, > probably related to Rack issues > 20. UVSoft proposes patch for HAML > 21. Alberto describes the problem - just as Hector did > 22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH > > What I propose is combining the two patches above to close this > issue, and give references to non-related tickets which give a > similar error. Ok, good. They'll need to be rebased against master, and I think Andrew's patch breaks some tests since it changes the ERB line numbers. > #Ticket 1988: Make utf8 partial rendering from within a content_for work in > ruby1.9 > https://rails.lighthouseapp.com/projects/8994/tickets/1988 > 1. Patch that works around the issue > 2. Jeremy Kemper does not accept the patch due to being utf-8 - only > 3. TICKET STATUS IS INCOMPLETE > > What I propose is solving #2188 first and then investigate this > bug further - it could be a bad assumption about the encoding of > strings returned by tag helpers in a specific case. > > #Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby > 1.9.1 > https://rails.lighthouseapp.com/projects/8994/tickets/2476 > 1. Hector describe database adaptor problem with 1.9 encodings, > provides a mysql-ruby fork and other links > 2. Patches and fixes for databases / adaptors (James Healy, Jakub > Kuźma, Yugui) > 3. Talk about assuming UTF-8 for databases > 4. Loren Segal proposes hack instead of modifying mysql-ruby > 5. Micheal Hasensein asks about issue 5 months later > 6. UVSoft accidentally posts HAML workaround > 6. TICKET STATUS IS NEW > > My proposal - after fixing #2188, a short description of > adapters/databases and fixed versions could be presented - and > possibly have this issue closed, to prevent it being listed as a > pending UTF-8 issue. Work could be started on validation code for > the strings returned by database adapters and their compatibility > with the "internal" encoding. +1 > Open/new tickets related to Rack: > > > https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app > > https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio > > https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors > > My proposal: gather issues and investigate with the help of people > working with non-utf and non-ascii input - I believe Japan is such > a country, where UTF-8 assumptions about Rack input are wrong. Rack is woefully lagging on encoding support. It needs an encoding push of its own. Ruby CGI has updated to include just-enough support, e.g. for giving an encoding for parsed query parameters. > I would like to thank everyone who invested even the slightest bit of > time in solving this issue. > > I hope the information here will help find a solution that will work > without issues for years to come and that creating Rails applications > will be an enjoyable experience for users, designers, developers, > translators and all contributors, regardless of their environment and > language preferences. Indeed! Thanks for leading the charge, Cezary. jeremy -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
