I'm running some impala-shell queries against Parquet files with user-entered
strings that are causing character encoding problems. I get Chinese characters
coming through just fine in results. There must be some more exotic or
non-UTF8 characters somewhere in the input. The errors look like the following
(citing different positions, sometimes echoing a u'' codepoint, always
mentioning range(128)):
Unknown Exception : 'ascii' codec can't encode characters in position 875-876:
ordinal not in range(128)
Could not execute command: select int_col, string_col from report where
string_col like "%${var:component}%" limit 250
Unknown Exception : 'ascii' codec can't encode character u'\u4e0e' in position
3698: ordinal not in range(128)
Could not execute command: select int_col, string_col from report where
string_col like "%${var:component}%" limit 250
Is there a WHERE technique or string regularizer function I could use to skip
over strings containing unrecognizable characters? SET MAX_ERRORS=0 and/or
ABORT_ON_ERROR=0 in advance of the queries didn't help. If I reduce the LIMIT
to something very low, the queries tend to work -- they seem to fail on the
first instance encountered of any problematic character. The impala-shell
commands are being issued from a bash script. ${var:component} is a
Hadoop-related name like 'impala' or 'kafka'.
Thanks,
John