Query non-ASCII data, filter out rows with problematic chars?

John Russell Mon, 15 May 2017 23:35:15 -0700

I'm running some impala-shell queries against Parquet files with user-entered 
strings that are causing character encoding problems.  I get Chinese characters 
coming through just fine in results.  There must be some more exotic or 
non-UTF8 characters somewhere in the input.  The errors look like the following 
(citing different positions, sometimes echoing a u'' codepoint, always 
mentioning range(128)):


Unknown Exception : 'ascii' codec can't encode characters in position 875-876: 
ordinal not in range(128)
Could not execute command: select int_col, string_col from report where 
string_col like "%${var:component}%" limit 250

Unknown Exception : 'ascii' codec can't encode character u'\u4e0e' in position 
3698: ordinal not in range(128)
Could not execute command: select int_col, string_col from report where 
string_col like "%${var:component}%" limit 250

Is there a WHERE technique or string regularizer function I could use to skip 
over strings containing unrecognizable characters? SET MAX_ERRORS=0 and/or 
ABORT_ON_ERROR=0 in advance of the queries didn't help.  If I reduce the LIMIT 
to something very low, the queries tend to work -- they seem to fail on the 
first instance encountered of any problematic character.  The impala-shell 
commands are being issued from a bash script.  ${var:component} is a 
Hadoop-related name like 'impala' or 'kafka'.

Thanks,
John

Query non-ASCII data, filter out rows with problematic chars?

Reply via email to