[Puppet - Bug #10291] UTF8 non-breaking space in a manifest breaks the parser

tickets Tue, 06 Dec 2011 13:18:07 -0800

Issue #10291 has been updated by Jeff McCune.


# Additional Information #

This is a more general encoding issue with Strings in Ruby 1.9 and later.  
We'll need to try and detect the encoding of each file we load and switch the 
encoding of the resulting string object on the fly.  Related to the paying 
customer support ticket (535) we specifically need to make this work with 
templates and the template() function.

A great description of the context and surrounding issues are located at: 
<http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings>

<blockquote>
I suspect early contact with the new m17n engine is going to come to Rubyists 
in the form of this error message:

invalid multibyte char (US-ASCII)
Ruby 1.8 didn't care what you stuck in a random String literal, but 1.9 is a 
touch pickier. I think you'll see that the change is for the better, but we do 
need to spend some time learning to play by Ruby's new rules.

That takes us to the first of Ruby's three default Encodings.

The Source Encoding

In Ruby's new grown up world of all encoded data, each and every String needs 
an Encoding. That means an Encoding must be selected for a String as soon as it 
is created. One way that a String can be created is for Ruby to execute some 
code with a String literal in it, like this:

str = "A new String"
That's a pretty simple String, but what if I use a literal like the following 
instead?

str = "Résumé"
What Encoding is that in? That fundamental question is probably the main reason 
we all struggle a bit with character encodings. You can't tell just from 
looking at that data what Encoding it is in. Now, if I showed you the bytes you 
may be able to make an educated guess, but the data just isn't wearing an 
Encoding name tag.

That's true of a frightening lot of data we deal with every day. A plain text 
file doesn't generally say what Encoding the data inside is in. When you think 
about that, it's a miracle we can successfully read a lot of things.

When we're talking about program code, the problem gets worse. I may want to 
write my code in UTF-8, but some Japanese programmer may want to write his code 
in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate 
things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a 
gem and the Japanese programmer later uses it to help with his Shift JIS code. 
How do we make that work seamlessly?

The Ruby 1.8 strategy of one global variable won't survive a test like this, so 
it was time to switch strategies. Ruby 1.9's answer to this problem is the 
source Encoding.

All Ruby source code now has some Encoding. When you create a String literal in 
your code, it is assigned the Encoding of your source. That simple rule solves 
all the problems I just described pretty nicely. As long my source Encoding is 
UTF-8 and the Japanese programmer's source Encoding is Shift JIS, my literals 
will work as I expect and his will work as he expects. Obviously if we share 
any data, we will need to establish some rules about our shared formats using 
documentation or code that can adapt to different Encodings, but we should have 
been doing that all along anyway.

Thus the only question becomes, what's my source Encoding and how do I change 
it?
</blockquote>
----------------------------------------
Bug #10291: UTF8 non-breaking space in a manifest breaks the parser
https://projects.puppetlabs.com/issues/10291

Author: Oliver Hookins
Status: Accepted
Priority: Normal
Assignee: Jeff McCune
Category: ruby19
Target version: 
Affected Puppet version: 2.6.7
Keywords: 
Branch: 


<code>
err: Could not parse for environment production: Could not match  Yum::Repo at 
/home/ohookins/svn/redacted/repo.pp:4
</code>

The actual code is unremarkable, but the problem is here:

<code>
00000020  20 7b 0a 20 c2 a0 59 75  6d 3a 3a 52 65 70 6f 20  | {. ..Yum::Repo |
00000030  7b 0a 20 c2 a0 c2 a0 c2  a0 6d 65 74 61 64 61 74  |{. ......metadat|
</code>

Somehow we've ended up with a UTF8 "nbsp" in our manifest (the 0xc2a0). Sure, I 
can just remove these characters but it suggests to me that perhaps the Unicode 
support in the parser is incomplete, which is a larger problem for 
internationalisation.




-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Puppet - Bug #10291] UTF8 non-breaking space in a manifest breaks the parser

Reply via email to