Re: Query non-ASCII data, filter out rows with problematic chars?

John Russell Mon, 15 May 2017 23:52:01 -0700

Round 2 of diagnosis.  The Chinese characters, e.g. 语句, come through fine when 
I run the query interactively in impala-shell, but not in impala-shell -q 
through a bash script.  I tried bash idioms like:


stty iutf8

export LC_CTYPE=C
export LANG=C

export LC_CTYPE=zh_CN.utf8
export LANG=zh_CN.utf8

to no avail.  This is different from IMPALA-532 where the problem is due to 
specifying a non-existent locale.

Thanks,
John

> On May 15, 2017, at 11:34 PM, John Russell <[email protected]> wrote:
> 
> I'm running some impala-shell queries against Parquet files with user-entered 
> strings that are causing character encoding problems.  I get Chinese 
> characters coming through just fine in results.  There must be some more 
> exotic or non-UTF8 characters somewhere in the input.  The errors look like 
> the following (citing different positions, sometimes echoing a u'' codepoint, 
> always mentioning range(128)):
> 
> Unknown Exception : 'ascii' codec can't encode characters in position 
> 875-876: ordinal not in range(128)
> Could not execute command: select int_col, string_col from report where 
> string_col like "%${var:component}%" limit 250
> 
> Unknown Exception : 'ascii' codec can't encode character u'\u4e0e' in 
> position 3698: ordinal not in range(128)
> Could not execute command: select int_col, string_col from report where 
> string_col like "%${var:component}%" limit 250
> 
> Is there a WHERE technique or string regularizer function I could use to skip 
> over strings containing unrecognizable characters? SET MAX_ERRORS=0 and/or 
> ABORT_ON_ERROR=0 in advance of the queries didn't help.  If I reduce the 
> LIMIT to something very low, the queries tend to work -- they seem to fail on 
> the first instance encountered of any problematic character.  The 
> impala-shell commands are being issued from a bash script.  ${var:component} 
> is a Hadoop-related name like 'impala' or 'kafka'.
> 
> Thanks,
> John

Re: Query non-ASCII data, filter out rows with problematic chars?

Reply via email to