[ 
https://issues.apache.org/jira/browse/AIRFLOW-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304466#comment-15304466
 ] 

ASF subversion and git services commented on AIRFLOW-179:
---------------------------------------------------------

Commit 87b4b8fa19cb660317198d74f6d51fdde0a7e067 in incubator-airflow's branch 
refs/heads/master from [~john.bod...@gmail.com]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=87b4b8f ]

[AIRFLOW-179] DbApiHook string serialization fails when string contains 
non-ASCII characters

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:
- https://issues.apache.org/jira/browse/AIRFLOW-179

In addition to correctly serializing non-ASCII characters the literal 
transformation also corrects an issue with escaping single quotes (').

Note it was my intention to add another unit test to `test_hive_to_mysql` in 
`tests/core.py` however on inspection the indentations of the various methods 
seemed wrong, methods are nested and it's not apparent what class they refer 
to. Additionally it seems a number of the test cases aren't related to the 
corresponding class.

For testing purposes I simply ran a pipeline which previously failed with the 
following exception,

    [2016-05-26 22:03:39,256] {models.py:1286} ERROR - 'ascii' codec can't 
decode byte 0xc3 in position 230: ordinal not in range(128)
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 
1245, in run
    result = task_copy.execute(context=context)
      File 
"/usr/local/lib/python2.7/dist-packages/airflow/operators/hive_to_mysql.py", 
line 88, in execute
    mysql.insert_rows(table=self.mysql_table, rows=results)
      File 
"/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line 176, 
in insert_rows
    l.append(self._serialize_cell(cell))
      File 
"/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line 196, 
in _serialize_cell
    return "'" + str(cell).replace("'", "''") + "'"
      File "/usr/local/lib/python2.7/dist-packages/future/types/newstr.py", 
line 102, in __new__
    return super(newstr, cls).__new__(cls, value)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 230: 
ordinal not in range(128)

and verified with the presence of the fix that the task succeeded and the 
resulting output was correct. Note currently from grokking the code base it 
seems that only `MySqlHook` objects call the the `insert_rows` method.

Author: John Bodley <john.bod...@airbnb.com>

Closes #1550 from johnbodley/dbapi_hook_serialization.


> DbApiHook string serialization fails when string contains non-ASCII characters
> ------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-179
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-179
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: hooks
>            Reporter: John Bodley
>            Assignee: John Bodley
>             Fix For: Airflow 1.8
>
>
> The DbApiHook.insert_rows(...) method tries to serialize all values to 
> strings using the ASCII codec,  this is problematic if the cell contains 
> non-ASCII characters, i.e.
>     >>> from airflow.hooks import DbApiHook
>     >>> DbApiHook._serialize_cell('Nguyễn Tấn Dũng')
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>       File 
> "/usr/local/lib/python2.7/dist-packages/airflow/hooks/dbapi_hook.py", line 
> 196, in _serialize_cell
>         return "'" + str(cell).replace("'", "''") + "'"
>       File "/usr/local/lib/python2.7/dist-packages/future/types/newstr.py", 
> line 102, in __new__
>         return super(newstr, cls).__new__(cls, value)
>     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 4: 
> ordinal not in range(128)
> Rather than manually trying to serialize and escape values to an ASCII string 
> one should try to serialize the value to string using the character set of 
> the corresponding target database leveraging the connection to mutate the 
> object to the SQL string literal.
> Additionally the escaping logic for single quotes (') within the 
> _serialize_cell method seems wrong, i.e. 
>     str(cell).replace("'", "''")
> would escape the string "you're" to be "'you''ve'" as opposed to "'you\'ve'".
> Note an exception should still be thrown if the target encoding is not 
> compatible with the source encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to