Re: [HACKERS] plperlu problem with utf8

David E. Wheeler Thu, 16 Dec 2010 19:25:22 -0800

On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:

> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded.  Anyway, I think postgres is doing the right
> thing here.


No, URI::Escape is fine. The issue is that if you don't decode text to Perl's 
internal form, it assumes that it's Latin-1.

> In playing around I did find what I think is a postgres bug.  Perl has
> 2 ways it can store things internally.  per perldoc perlunicode:
> 
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string.  The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
> 
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
> 
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
> 
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
> 
> # SELECT length(chr(128, 33));
> length
> --------
>      2
> 
> # SELECT length(chr(128, 333));
> length
> --------
>      4
> 
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1).  In this case the 333 causes perl store
> it internally as utf8.

Well with SQL_ASCII anything goes, no?

> Now on a utf8 database:
> 
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
> 
> # SELECT length(chr(128, 33));
> ERROR:  invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT:  PL/Perl function "chr"
> 
> # SELECT length(chr(128, 333));
> CONTEXT:  PL/Perl function "chr"
> length
> --------
>      2
> 
> Same thing here, we just end up using the internal format.  In one
> case it works in the other it does not.  The main point being, most of
> the time it *happens* to work.  But its really just by chance.
> 
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified().  SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence).  While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
> 
> Something like the attached.  Thoughts? Im not very happy with the non
> utf8 case--  The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes.  But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade().  Is it worth it?
> <plperl_encoding.patch>

Maybe I'm misunderstanding, but it seems to me that:

* String arguments passed to PL/Perl functions should be decoded from the 
server encoding to Perl's internal representation before the function actually 
gets them.

* Values returned from PL/Perl functions that are in Perl's internal 
representation should be encoded into the server encoding before they're 
returned.

I didn't really follow all of the above; are you aiming for the same thing?

Best,

David


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to