[ 
https://issues.apache.org/jira/browse/IMPALA-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269139#comment-17269139
 ] 

ASF subversion and git services commented on IMPALA-10434:
----------------------------------------------------------

Commit 4c6cf4b2efb37a80b9069f5da69931e585bfea7e in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4c6cf4b ]

IMPALA-10434: Fix impala-shell's unicode regressions on Python2

To make impala-shell compatible for Python3, we explicitly distinguish
bytes and text in Python2 by decoding the bytes for all inputs.

Regression 1: multiple queries in one line with unicode chars will break

In precmd() of impala-shell, if there are multiple queries present in
one input line, we split it into individual queries (by
sqlparse.split()) and append them back to the 'cmdqueue'. They will be
passed to precmd() again. In our Python2 implementation, precmd()
expects them to be str type, and will decode them into unicode type.
However, the output type of sqlparse.split() is unicode which doesn't
have a decode() method. Calling decode() on a unicode var will let
Python2 implicitly encode it to str. This may cause UnicodeEncodeError
since implicitly encoding use 'ascii'.

Regression 2: multi-line query with unicode chars will break when
command history is enabled

In _check_for_command_completion(), when calling
readline.replace_history_item in Python2. We encode the completed_cmd
into bytes. However, we shouldn't replace it since the return type is
expected to be unicode.

Tests:
 - Add tests for these two regressions in Python2.

Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Reviewed-on: http://gerrit.cloudera.org:8080/16960
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> impala-shell crash in parsing multiline queries that contain UTF-8 characters
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-10434
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10434
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Clients
>    Affects Versions: Impala 4.0
>            Reporter: Fang-Yu Rao
>            Assignee: Quanlong Huang
>            Priority: Critical
>
> I'm at master branch (commit a6a244099502329d9193b316ea26d5fd6451b6bd) and 
> hit this error:
> {code:java}
> [localhost:21050] default> select "你好";
> Query: select "你好"
> Query submitted at: 2020-12-30 11:00:40 (Coordinator: 
> http://quanlong-OptiPlex-BJ:25000)
> Query progress can be monitored at: 
> http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=554d2348a28884c6:30835a4800000000
> +--------+
> | '你好' |
> +--------+
> | 你好   |
> +--------+
> Fetched 1 row(s) in 0.12s
> [localhost:21050] default> select
>                          > "你好";
> Traceback (most recent call last):
>   File "/home/quanlong/workspace/Impala/shell/impala_shell.py", line 2062, in 
> <module>
>     impala_shell_main()
>   File "/home/quanlong/workspace/Impala/shell/impala_shell.py", line 2027, in 
> impala_shell_main
>     shell.cmdloop(intro)
>   File 
> "/home/quanlong/workspace/Impala/toolchain/toolchain-packages-gcc7.5.0/python-2.7.16/lib/python2.7/cmd.py",
>  line 141, in cmdloop
>     line = self.precmd(line)
>   File "/home/quanlong/workspace/Impala/shell/impala_shell.py", line 631, in 
> precmd
>     args = self.sanitise_input(args.decode('utf-8'))  # python2
>   File "/home/quanlong/workspace/Impala/shell/impala_shell.py", line 435, in 
> sanitise_input
>     tokens = args.strip().split(' ')
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8: 
> ordinal not in range(128) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to