Round 2 of diagnosis. The Chinese characters, e.g. 语句, come through fine when I run the query interactively in impala-shell, but not in impala-shell -q through a bash script. I tried bash idioms like:
stty iutf8 export LC_CTYPE=C export LANG=C export LC_CTYPE=zh_CN.utf8 export LANG=zh_CN.utf8 to no avail. This is different from IMPALA-532 where the problem is due to specifying a non-existent locale. Thanks, John > On May 15, 2017, at 11:34 PM, John Russell <[email protected]> wrote: > > I'm running some impala-shell queries against Parquet files with user-entered > strings that are causing character encoding problems. I get Chinese > characters coming through just fine in results. There must be some more > exotic or non-UTF8 characters somewhere in the input. The errors look like > the following (citing different positions, sometimes echoing a u'' codepoint, > always mentioning range(128)): > > Unknown Exception : 'ascii' codec can't encode characters in position > 875-876: ordinal not in range(128) > Could not execute command: select int_col, string_col from report where > string_col like "%${var:component}%" limit 250 > > Unknown Exception : 'ascii' codec can't encode character u'\u4e0e' in > position 3698: ordinal not in range(128) > Could not execute command: select int_col, string_col from report where > string_col like "%${var:component}%" limit 250 > > Is there a WHERE technique or string regularizer function I could use to skip > over strings containing unrecognizable characters? SET MAX_ERRORS=0 and/or > ABORT_ON_ERROR=0 in advance of the queries didn't help. If I reduce the > LIMIT to something very low, the queries tend to work -- they seem to fail on > the first instance encountered of any problematic character. The > impala-shell commands are being issued from a bash script. ${var:component} > is a Hadoop-related name like 'impala' or 'kafka'. > > Thanks, > John
