Am 14.11.2009 um 21:17 schrieb Matthias Neeracher:


On Nov 14, 2009, at 15:44 , MacRuby wrote:

#339: YAML error with UTF-16 string
--------------------------- +------------------------------------------------
Reporter:  d...@…          |        Owner:  lsansone...@…
    Type:  defect         |       Status:  closed
Priority:  critical       |    Milestone:  MacRuby 0.5
Component:  MacRuby        |   Resolution:  fixed
Keywords:  YAML encoding  |
--------------------------- +------------------------------------------------

Comment(by jazz...@…):

{{{
$ macruby -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xFCbe"
$ ruby1.9 -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xC3\xBCbe"
}}}

seems to work now! Macruby escpapes to UTF-16 and Ruby1.9 escapes to
UTF-8.

Actually, it seems to me (though I'm willing to be corrected on this), that the ruby1.9 encoding is simply wrong: It translates the accented character into UTF-8, and then escapes the two UTF-8 characters separately. What this ends up encoding is "Rübe", which is not what you want.

I didn't find anything in YAML docs that describes that behaviour, both methods seem to be correct.

They can't possibly be BOTH correct, as interpreting the output of one according to the theory of the other would give a different result. If you look at the section in the YAML spec: <http://www.yaml.org/spec/1.2/spec.html#id2776092 >, you will see

        [57] "Escaped 8-bit Unicode character."

This is NOT an UTF-8 character.

But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing because IMHO there is now way to guess what is the correct escaping mode.

It's not astonishing because (a) 1.8 has very poor Unicode support anyway and (b) this would hardly be the only bug in syck.


OK, you are right!

When I started generating a YAML in macruby and importing it to ruby 1.8 I haven't done anything with Unicode, so I am not very experienced yet.


I think escaping is not necessary here because the encoding of input and
output is the same. This can easly be tested by

{{{
$ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
Rübe
}}}

That's an interesting point. I think you're right that the YAML spec does not require escaping of printable characters >\u007F. However, non-printable characters DO have to be escaped, and for the printable ones, it could be argued that erring on the side of escaping helps readability if the OS does not have font coverage for some printable characters. In any case, the current implementation tries to be conservative in what it generates and liberal in what it accepts. I'm open to persuasion that we should avoid escaping characters, provided there is a low-cost test for printability of general Unicode characters (I have not yet checked whether one of the built-in CFCharacterSets can give that; the descriptions were inconclusive).


The YAML spec, Chapter 5.1 Character Sets says:

> "To ensure readability, YAML streams use only the printable subset of the Unicode character set"

> [1] c-printable ::= #x9 | #xA | #xD | [#x20-#x7E] /* 8 bit */
> | #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD] /* 16 bit */
> | [#x10000-#x10FFFF]                     /* 32 bit */

Only characters that are not "c-printable" MUST be escaped and this is well defined. (For Strings you have to add the " and the \ as special characters).

> "...In addition, any allowed characters known to be non-printable SHOULD also be escaped. > This isn’t mandatory since a full implementation would require extensive character property tables."

So it is a SHOULD and not a MUST because it is too expensive. The YAML spec is a little bit confusing with "allowed characters" and "non printing characters".

Bernd



_______________________________________________
MacRuby-devel mailing list
MacRuby-devel@lists.macosforge.org
http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel

Reply via email to