On Thu, Jun 14, 2012 at 9:22 AM, jcbollinger <john.bollin...@stjude.org>wrote:

>
>
> On Wednesday, June 13, 2012 5:11:49 PM UTC-5, Chris Price wrote:
>>
>> [...]  Due to limitations in Puppet's representation of strings
>> (character encoding is not explicitly specified), it's not possible for us
>> to do anything too fancy when we encounter a byte sequence that is not
>> directly representable in UTF-8.
>>
>
> Is Puppet's representation of strings distinct from Ruby's
> representation?  In any case, it seems like a fundamental problem that
> Puppet is working with a bunch of strings whose encoding is uncertain.  Why
> can't that be tackled farther upstream with a mechanism for ensuring that
> Puppet uses a consistent and known encoding for strings?  Or even that it
> uses UTF-8 internally, so that no transcoding is needed when sending data
> to puppetdb?
>

Agreed, it can and should be tackled upstream! I believe there's already a
ticket for that, but I'll verify that assumption.

Your suspicions are correct: Puppet doesn't currently track the encoding of
any strings inside the language. Once a string is "inside" of Puppet, we no
longer know what its original encoding was. All we have are bytes. Nothing
is converted to an internal, "neutral" encoding, nor do we maintain
metadata about the original character set. A string in Puppet could contain
ASCII, Latin-1, UTF-8, Shift-JIS, binary data, etc...and we unfortunately
don't have any way to distinguish between them.

So until this issue is fixed in core Puppet, if we need to send those bytes
over-the-wire to a system that actually cares about the precise encoding of
what you're sending, our options are limited. What we do in the PuppetDB
terminus is apply a heuristic: we attempt to convert the string to UTF-8.
For things like ASCII (which in our research represents the lion's share of
Puppet code out there) this works fine, and preserves all data. For things
like Latin-1 etc., though, which can't be transcoded in a lossless way, we
emit the warning and try to preserve as much of the original data as we
can. Once the root cause is fixed, though, PuppetDB can take advantage of
it with very minor changes to our terminus code.


Furthermore, what do you mean by "a byte sequence that is not directly
> representable in UTF-8"?  UTF-8 encodes characters as bytes, not bytes as
> bytes.  No byte sequence is inherently non-representable.  For example, you
> can encode any byte sequence in UTF-8 by assuming that it represents a
> sequence of Latin1-encoded characters, so that the bytes are also the
> characters' Unicode scalar values.  Do you perhaps mean "a byte sequence
> that isn't already valid UTF-8"?
>

Yes, that phrasing is more accurate. :)


>
> I understand that Ruby 1.8 has pretty dismal character encoding support,
> but there are ways to deal with it.  Surely you can do better than just an
> improved warning and a "don't do that".
>
> At least there is a potential for some user guidance.  For example, would
> the problem be adequately addressed if all manifests and data were encoded
> in UTF-8 and the agent were ensured to run in a UTF-8-based locale?
>


Correct on all accounts, I think. I'll add that suggestion to the ticket.
Ultimately, this needs be fixed in core...then downstream services
(PuppetDB, Foreman, Dashboard, etc) can all benefit. But certainly, I
believe that improved guidelines would help. And the ticket that Chris
filed earlier against PuppetDB specifically will at least help users in the
interim figure out precisely which resources are giving us trouble.

Cheers,
Deepak

--
Deepak Giridharagopal / Puppet Labs / @grim_radical

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to