John Bodley created AIRFLOW-179:
-----------------------------------

             Summary: DbApiHook string serialization fails when string contains 
non-ASCII characters
                 Key: AIRFLOW-179
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-179
             Project: Apache Airflow
          Issue Type: Bug
          Components: hooks
            Reporter: John Bodley
            Assignee: John Bodley


The DbApiHook.insert_rows(...) method tries to serialize all values to strings 
using the ASCII codec,  this is problematic if the cell contains non-ASCII 
characters, i.e.

>>> from airflow.hooks import DbApiHook
>>> DbApiHook._serialize_cell('Nguyễn Tấn Dũng')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", 
line 196, in _serialize_cell
    return "'" + str(cell).replace("'", "''") + "'"
  File "/usr/local/lib/python2.7/dist-packages/future/types/newstr.py", line 
102, in __new__
    return super(newstr, cls).__new__(cls, value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 4: ordinal 
not in range(128)


Rather than manually trying to serialize values to an ASCII string one should 
try to serialize the value to string using the character set of the 
corresponding target database leveraging the connection to mutate an object to 
the SQL string literal.

Note an exception should still be thrown if the target encoding is not 
compatible with the source encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to