Re:Re: Re?? IMPALA-2428 Support multiple-character string as the field delimiter

Yuanhao Luo Mon, 01 Aug 2016 02:17:25 -0700

Hello, Jim Apple:
For normal create table statement "create table 
text_comma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row 
format delimited fields terminated by ',' escaped by '\\' lines terminated by 
'\n';" and file:
one,two,3,4
one\,one,two,3,4
one\\,two,3,4
one\\\,one,two,3,4
one\\\\,two,3,4
After loading this file into table text_comma_backslash_newline, the result of 
query "select * from text_comma_backslash_newline" is:


+----------+------+------+------+


| col1         | col2   | col3   | col4  |


+----------+------+------+------+


| one          | two   | 3       | 4        |   


| one,one   | two   | 3       | 4        |    


| one\         | two   | 3       | 4        |   


| one\,one  | two   | 3       | 4        |   


| one\\        | two   | 3      | 4        |   


+----------+------+------+------+


and this is what we expected.



For case  escape character is the same value as field delimiter , given create 
table statement "create table text_at_at_newline(col1 string, col2 string, col3 
int, col4 int) row format delimited fields terminated by '@' escaped by '@' 
lines terminated by '\n';" and file:
one@two@3@4

one@,one@two@3@4

one@\@two@3@4

one@\@,one@two@3@4

one@\@\@two@3@4

after loading this file into table text_at_at_newline, the result of query 
"select * from text_at_at_newline" is:+------+------+------+------+

| col1  | col2  | col3  | col4 |

+------+------+------+------+

| one   | two   | 3      | 4      |   

| one   | ,one  | NULL | 3    |   

| one   | \        | NULL | 3    |   

| one   | \        | NULL | NULL |

| one   | \        | NULL | NULL |

+------+------+------+------+

but this is not what we expected. For result of second row "| one   | ,one  | 
NULL | 3    | ", original line is "one@,one@two@3@4".Taking '@' as escaped 
character and field delimiter, we expect the first '@' is treated as escaped 
character, so the value of first column would be 'one,one'. However, the text 
parser treated the first '@' as field delimiter, so the value of first column 
is 'one' and the value of second column comes to ',one'.


For case escape character is the same value as tuple delimiter,  given create 
table statement "create table text_comma_backslash_backslash(col1 string, col2 
string, col3 int, col4 int) row format delimited fields terminated by ',' 
escaped by '\\' lines terminated  by '\\';" and file:

one,two,3,4\one\,one,two,3,4\one\\,two,3,4\one\\\,one,two,3,4\one\\\\,two,3,4


The backslash of color red(backslash after digit '4') represents for tuple 
delimiter. After loading this file into table text_comma_backslash_backslash, 
the result of query "select * from text_comma_backslash_backslash" 
is:+------+------+------+------+

| col1  | col2  | col3  | col4 |

+------+------+------+------+

| one  | two    | 3    | 4    |

| one  | NULL | NULL | NULL |

|        | one      | NULL | 3    |

|        | NULL | NULL | NULL |

|        | two      | 3    | NULL |

|        | NULL | NULL | NULL |

|        | NULL | NULL | NULL |

|        | one      | NULL | 3    |

|        | NULL | NULL | NULL |

|        | NULL | NULL | NULL |

|        | NULL | NULL | NULL |

|        | two      | 3    | 4    |

+------+------+------+------+

Again, this is not what we expected. For result of second row "| one  | NULL | 
NULL | NULL |", the original value is "one\,one,two,3,4". We expect to treat 
first backslash as escaped character, however, the text parser take it as the 
tuple delimiter, so the value of col2, col3 and col4 turn to be 'NULL'.

In conclusion,in the current  branch cdh5-trunk(commit id: 50a7ba059), it seems 
that it cat't handle these two corner cases properly. I'm wondering whether 
have you ever test these two cases.


To make multi-byte field delimiter works properly, I add these five 
restrictions in CreateTableStmt.java::analyzeRowFormat():
1. Delimiters can't be an empty string 
2. Tuple delimiter can't be the first byte of field delimiter 
3. Escape character can't be the first byte of field delimiter 
4. Escape character and tuple delimiter can't the be same value 
5. Terminators can't contains '\0'.


After adding restriciton 5, when the terminator contains '\0', it fails in a 
more orderly way(throws an AnalysisException with error message) than the 
current code does(throws an ImpalaRuntimeException with many stack trace).


What's more, in this patch, we can use only standard ASCII characters(with 
decimal value from 0 to 127) in ascii or octal format to set filed terminator, 
but not extended ASCII characters(with decimal value from 128 to 255) or 
standard ASCII characters in unicode, decimal or hexadecimal format. For 
example, to make standard ASCII characters "#@#" as field delimiter, we can use 
fields terminated by '#\100\043', but not '\u0023', '35', '\x23' respectively. 
I can't find a solution to unescape decimal and hexadecimal string. And it 
seems that there's a bug for SqlParser.parse() to parse unicode string and 
octol value of extended ASCII characters.  Yes, we should discuss this bug in 
another discussion. And for now, I suggest that we should use only standard 
ASCII characters as field delimiter.




------------------ ???????? ------------------
??????: Jim Apple  <[email protected]>;
????????: 2016-07-27 04:19:23
??????:Yuanhao Luo <[email protected]>
????:dev@impala <[email protected]>
????: Re: Re?? IMPALA-2428 Support multiple-character string as the field 
delimiter



On Mon, Jul 25, 2016 at 7:30 AM, Yuanhao Luo <[email protected]> 
wrote:


Hello, Jim Apple:
I can't find any tests for case escape character is the same value as field 
delimiter and case escape character is the same value as tuple delimiter from 
testdata/workloads/functional-query/queries/QueryTest/delimited-text.test.
I ran some tests on branch cdh5-trunk(commit id: 50a7ba059), and logs below 
show that even though we have already add warning "WARNINGS: Field delimiter 
and escape character have same value. Escape character will be ignored" and 
"WARNINGS: Line delimiter and escape character have same value: . Escape 
character will be ignored" for these two corner cases, but codes don't work as 
expected.


Can you describe what behavior you expected?



It's a little difficult for me to fix these corner cases, so in my next patch, 
I'm going to enhance restriction as below:

1. Delimiters can't be an empty string.
2. Tuple delimiter can't be the first byte of field delimiter.
3. Escape char can't be the first byte of field delimiter.
4. Escape char and tuple delimiter can't be the same.
5. Delimiters can't contain '\0'.Whenever you change your planned design, we 
need to re-evaluate it against the current code to see how it is different, 
especially any changes that break currently working code.
 

What's more, in my tests, I found that sql-parser.cup can't parse unicode and 
octol of extended ASCII character(with decimal value from 128 to 255) 
correctly. For example, if we want to set "#@#" as fields terminator, we can 
use fields terminated by '\u0023\100\043' , which refers to ASCII #@# 
respectively. The parse result is right. However, when I want to set double 
thorn(extended ASCII character with decimal value 254) as field terminator, for 
example fields terminated by '\u00fe\376', it turns out to '\u00A4376' when I 
run 'describe extended table'. I have report this issue in IMPALA-3777 already.


I don't see a reason to bring that bug into this discussion. Do you?

Re:Re: Re?? IMPALA-2428 Support multiple-character string as the field delimiter

Reply via email to