Hi,

I'm seeing some unexpected behavior with PyGreSQL running under Python 2.x
when reading utf8-encoded data from the db.

Basically, PyGreSQL/Python2.x always seems to return it as a str() rather
than a unicode().

I can work around this at a high level by calling:

    pgdb.set_typecast('varchar', lambda x: None if x is None else
x.decode('utf8'))


...but shouldn't I get a unicode() without having to do that?  I looked
through the documentation and the mailing list archives and didn't see
anything about this.

Details:

pgmodule.c makes use of IS_PY3 in a manner that ensures that no special
encoding is done for strings under Python 2.x.  For instance, in
sourceFetch:

#if IS_PY3
                if (PQfformat(self->result, j) == 0) /* textual format */
                {
                    str = get_decoded_string(s, size, encoding);
                    if (!str) /* cannot decode */
                        str = PyBytes_FromStringAndSize(s, size);
                }
                else
#endif
                str = PyBytes_FromStringAndSize(s, size);


This means that under Python 2.x, raw bytes from the db column are handed
through to the Python layer.

And in the python layer, the configuration of the Typecasts() results in
noop, meaning raw utf-8 bytes in a str()

Here's an actual example:

    postgres=> select unicode_column from unicode_table where
column_id='key';
     unicode_column
    ---------------
     I ❤ Huckabees
    (1 row)

Here's proof that it's UTF-8 encoded:

    postgres=> select array_agg(t) from (select
ascii(regexp_split_to_table(unicode_column, '')) AS t from unicode_table
where column_id='key') x;
                        array_agg
    --------------------------------------------------
     {73,32,10084,32,72,117,99,107,97,98,101,101,115}
    (1 row)

Now from Python:

    >>> sys.version
    '2.7.8 (default, Mar 31 2018, 02:47:11) \n[GCC 4.1.2 20070626 (Red Hat
4.1.2-14)]'
    >>> print db.query("select unicode_column from unicode_table where
column_id='key'").getresult()
    [('I \xe2\x9d\xa4 Huckabees',)]

...whereas I was expecting:

    u'I \u2764 Huckabees'



Thanks!

Murray
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Reply via email to