Am 14.11.2009 um 21:17 schrieb Matthias Neeracher:
On Nov 14, 2009, at 15:44 , MacRuby wrote:
#339: YAML error with UTF-16 string
---------------------------
+------------------------------------------------
Reporter: d...@… | Owner: lsansone...@…
Type: defect | Status: closed
Priority: critical | Milestone: MacRuby 0.5
Component: MacRuby | Resolution: fixed
Keywords: YAML encoding |
---------------------------
+------------------------------------------------
Comment(by jazz...@…):
{{{
$ macruby -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xFCbe"
$ ruby1.9 -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xC3\xBCbe"
}}}
seems to work now! Macruby escpapes to UTF-16 and Ruby1.9 escapes to
UTF-8.
Actually, it seems to me (though I'm willing to be corrected on
this), that the ruby1.9 encoding is simply wrong: It translates the
accented character into UTF-8, and then escapes the two UTF-8
characters separately. What this ends up encoding is "Rübe", which
is not what you want.
I didn't find anything in YAML docs that describes that behaviour,
both methods seem to be correct.
They can't possibly be BOTH correct, as interpreting the output of
one according to the theory of the other would give a different
result. If you look at the section in the YAML spec: <http://www.yaml.org/spec/1.2/spec.html#id2776092
>, you will see
[57] "Escaped 8-bit Unicode character."
This is NOT an UTF-8 character.
But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing
because IMHO there is now way to guess what is the correct escaping
mode.
It's not astonishing because (a) 1.8 has very poor Unicode support
anyway and (b) this would hardly be the only bug in syck.
OK, you are right!
When I started generating a YAML in macruby and importing it to ruby
1.8 I haven't done anything with Unicode, so I am not very experienced
yet.
I think escaping is not necessary here because the encoding of
input and
output is the same. This can easly be tested by
{{{
$ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
Rübe
}}}
That's an interesting point. I think you're right that the YAML spec
does not require escaping of printable characters >\u007F. However,
non-printable characters DO have to be escaped, and for the
printable ones, it could be argued that erring on the side of
escaping helps readability if the OS does not have font coverage for
some printable characters. In any case, the current implementation
tries to be conservative in what it generates and liberal in what it
accepts. I'm open to persuasion that we should avoid escaping
characters, provided there is a low-cost test for printability of
general Unicode characters (I have not yet checked whether one of
the built-in CFCharacterSets can give that; the descriptions were
inconclusive).
The YAML spec, Chapter 5.1 Character Sets says:
> "To ensure readability, YAML streams use only the printable subset
of the Unicode character set"
> [1] c-printable ::= #x9 | #xA | #xD | [#x20-#x7E] /* 8
bit */
> | #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD] /* 16 bit */
> | [#x10000-#x10FFFF] /* 32 bit */
Only characters that are not "c-printable" MUST be escaped and this is
well defined. (For Strings you have to add the " and the \ as special
characters).
> "...In addition, any allowed characters known to be non-printable
SHOULD also be escaped.
> This isn’t mandatory since a full implementation would require
extensive character property tables."
So it is a SHOULD and not a MUST because it is too expensive. The YAML
spec is a little bit confusing with "allowed characters" and "non
printing characters".
Bernd
_______________________________________________
MacRuby-devel mailing list
MacRuby-devel@lists.macosforge.org
http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel