Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Greg Sabino Mullane Sun, 17 Jul 2011 11:11:35 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>>> I find this description confusing. What is the default value for that 
>>> setting? 
>>> I mean, how can one know that?
> 
>> There is no default: it's computed on the fly at connection time, based 
>> on the server_encoding and the client_encoding.
>
> Yeah, that's what I meant. It's difficult to comprehend 
> how it calculates a value if you don't specify one.

Well, for most people it won't matter: DBD::Pg will simply do the 
right thing. Which for 99% of people will be to set client_encoding 
to UTF-8, which is really the only sensible option (excluding 
SQL_ASCII people, of course).

> There is also the PGCLIENTENCODING environment variable.

Ah, true dat.

>> Yes, or some wording along the lines of "this is an expert knob, and you 
>> really 
>> ought to leave it alone unless you really know what you are doing".

> Maybe. I'm not convinced, because if you don't set it yourself, the thing 
> it decides to do may or may not be what you expect, and it would be hard 
> to figure out why.

Well, it will set it to UTF-8, unless there is a really good reason not to. 
And the only exceptions are SQL_ASCII and if they went out of their way to 
set the client encoding themselves, in which case it would be rude of us 
to change it back on them. :)

>> Better than? This is in addition to the above, to be clear. This is 
>> basically a shortcut for someone setting pg_unicode false and issuing 
>> a "SET client_encoding = 'foo'".

> Unless I set it to "utf8", in which case pg_unicode would be true and 
> client_encoding would be set to "UTF-8". Right?

Right. Although in most cases that will be a no-op as those will already 
be set that way. Although a weak case could be argued that setting it 
to UTF-8 via the interface should turn pg_unicodde *off*, to be consistent.
But I think that's all the more reason for a separate knob, and one of the 
reasons I'm only lukewarm to the whole $h->{encoding} thing.

> I'm still on the fence about making 
> such a shortcut into a formal call. The advantage is that it removes 
> the case where someone sets client_encoding manually but forgets to 
> switch pg_unicode off.

> From the user's perspective, I think it makes much more sense. It says, 
> "Here is what I want the encoding to be," which is easier to understand 
> than "Should we or should we not convert the incoming data to Perl's 
> internal form." Most people won't know WTF that means.

Yeah, that's true. On the other hand, even the encoding setting is meant 
as sort of an expert knob.

>> We still need a flag to know if we are unicoding or not. We cannot tell just 
>> from a stored client_encoding.

> Why not? That's what pg_unicode was figuring out on its own if you didn't set 
> it.

Yes, but once we call $h->{encoding}, we need to track both the encoding and 
the fact that we are decoding or not. Which could be either way. Which raises 
a point: if we need a way to get things back to "normal" after the user 
sets $h->{encoding} to something weird, presumably they would then call 
$h->{encoding} = UTF-8. So perhaps that answers the above: we turn pg_unicode 
*on* in that case. But it still means that there is no way for someone to 
want a UTF-8 client_encoding but do NOT want us to decode things. Sigh.

(some more of the same arguments trimmed from your reply)

>>> Maybe. utf8 ne UTF-8, quite.
> 
>> Right, but it is the best we can do.

> Well, no, it's not. We can encode it with Perl's API for encoding 
> strings. Internally it might do nothing, but we should use that 
> API if it's there.

I meant that the only thing we can do with the internal strings 
is flip the utf8 bit on or off: we have no other knobs for 
other encodings.

>> Agree with the first, but not with the second: once the user sets 
>> pg_encoding, 
>> we stop messing with their data, both incoming and outgoing, in the 
>> expectation 
>> that they have entered expert mode and want to handle things themselves. 

> I disagree. I think the value of pg_encoding should be respected and things 
> encoded and decoded appropriately (unless it's SQL_ASCII or pg_unicode is 
> off).

>> Or at the very least, we have separate flags for incoming and outgoing 
>> tweaking.

> Oy. Let's not go there yet.

How about now? :) The problem is that people have existing scripts that we 
don't 
want to fail, and are trying to shove who-knows-what into the database, so we 
definitely want to clean up their mess as it comes in, but give them the option 
not to mess with it in case that is what they need. I think that should be a 
separate 
knob from the stuff coming back from the database. To put another way, I'm 
happy 
linking the two together for most things but providing an expert knob just in 
case 
they need it that can de-couple them.

I'm trying to make this as bulletproof as possible so that we break as few 
existing 
scripts as possible on the first release, and allow as much fine-tuning as 
needed 
from the get-go, since we cannot know what will really break or the strange 
combinations 
people will want until this is released in the wild.

>>> +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
>>> +retrieved/created SV which does not require it;
> 
>> GSM feels just as strongly we should set it on everything.

> I agree.

Ball's in your court, David C. :)

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201107171409
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk4jJbMACgkQvJuQZxSWSsijtQCfWX1GbuKwZowqSwKFE/9jL9yD
Pv0AoLCpgCYJ6nIUpBkAwukZSmSMl80S
=IhzM
-----END PGP SIGNATURE-----

Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Reply via email to