This is a review of  "plperl encoding issues"

https://commitfest.postgresql.org/action/patch_view?id=452

Purpose:
========
Your database uses one encoding, and passes data to perl in the same encoding, 
which perl is not prepared for (it assumes UTF-8).  This patch makes sure data 
is encoded into UTF-8 before its passed to plperl then converts the response 
from UTF-8 back to the database encoding for storage.

My test:

ptest2=# create database ptest2 encoding 'EUC_JP' template template0;

I created a simple perl function that reverses the string.  I don't know Japanese so I 
found a tattoo website that had sayings in Japanese... I picked: "I am awesome".

create or replace function preverse(x text) returns text as $$
        my $tmp = reverse($_[0]);
        return $tmp;
$$ LANGUAGE plperl;


Before the patch:

ptest2=#select preverse('私はよだれを垂らす');

      preverse
--------------------
 垢蕕眇鬚譴世茲呂篁
(1 row)

It is also possible to generate invalid characters.  This function pulls off 
the last character in the string... assuming its UTF-8

create or replace function plastchar(x text) returns text as $$
        my $tmp = substr($_[0], -1);
        return $tmp;
$$ LANGUAGE plperl;

ptest2=# select plastchar('私はよだれを垂らす');

ERROR:  invalid byte sequence for encoding "EUC_JP": 0xb9
CONTEXT:  PL/Perl function "plastchar"

Because the string was not UTF-8, perl got confused and returned an invalid 
character.

After the patch:
The exact same plperl functions work fine:

ptest2=# select preverse('私はよだれを垂らす');

      preverse
--------------------
 すら垂をれだよは私
(1 row)

ptest2=# select plastchar('私はよだれを垂らす');

 plastchar
-----------
 す
(1 row)




Performance:
============
This is a bug fix, not for performance, however, as noted by the author, many 
encodings will be very UTF-8'ish and the overhead will be very small.  For 
those encodings that would need converted, you'd need to do the same convert  
inside your perl function anyway before you could use the data.  The processing 
has just moved from inside your perl func to inside PG.




The Patch:
==========
Applies clean to git head as of January 15 2011.  PG built with 
--enable-cassert and --enable-debug seems to run fine with no errors.

I don't think regression tests cover plperl, so understandable there are no 
tests in the patch.

There is no manual updates in the patch either, and I think there should be.  I 
think it should be made clear
that data (varchar, text, etc.  but not bytea) will be passed to perl as UTF-8, 
regardless of database encoding.  Also that "use utf8;" is always loaded and in 
use.



Code Review:
============
I am not qualified.  Looking through the patch, I'm reminded of the old saying: "Any 
sufficently advanced perl XS code is indistinguishable from magic"  :-)


Other Remarks:
==============
- Yes I know... it was a joke.
- I sure hope this posts to the news group ok
- My terminal (konsole) had a hard time displaying Japanese, so I used psql's 
\i and \o to read/write files that kwrite show'd/encoded correctly via EUC_JP


Summary:
========
Looks good.  Looks needed.  Needs manual updates.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to