Hello, Jim Apple: For normal create table statement "create table text_comma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\n';" and file: one,two,3,4 one\,one,two,3,4 one\\,two,3,4 one\\\,one,two,3,4 one\\\\,two,3,4 After loading this file into table text_comma_backslash_newline, the result of query "select * from text_comma_backslash_newline" is:
+----------+------+------+------+ | col1 | col2 | col3 | col4 | +----------+------+------+------+ | one | two | 3 | 4 | | one,one | two | 3 | 4 | | one\ | two | 3 | 4 | | one\,one | two | 3 | 4 | | one\\ | two | 3 | 4 | +----------+------+------+------+ and this is what we expected. For case escape character is the same value as field delimiter , given create table statement "create table text_at_at_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by '@' escaped by '@' lines terminated by '\n';" and file: one@two@3@4 one@,one@two@3@4 one@\@two@3@4 one@\@,one@two@3@4 one@\@\@two@3@4 after loading this file into table text_at_at_newline, the result of query "select * from text_at_at_newline" is:+------+------+------+------+ | col1 | col2 | col3 | col4 | +------+------+------+------+ | one | two | 3 | 4 | | one | ,one | NULL | 3 | | one | \ | NULL | 3 | | one | \ | NULL | NULL | | one | \ | NULL | NULL | +------+------+------+------+ but this is not what we expected. For result of second row "| one | ,one | NULL | 3 | ", original line is "one@,one@two@3@4".Taking '@' as escaped character and field delimiter, we expect the first '@' is treated as escaped character, so the value of first column would be 'one,one'. However, the text parser treated the first '@' as field delimiter, so the value of first column is 'one' and the value of second column comes to ',one'. For case escape character is the same value as tuple delimiter, given create table statement "create table text_comma_backslash_backslash(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',' escaped by '\\' lines terminated by '\\';" and file: one,two,3,4\one\,one,two,3,4\one\\,two,3,4\one\\\,one,two,3,4\one\\\\,two,3,4 The backslash of color red(backslash after digit '4') represents for tuple delimiter. After loading this file into table text_comma_backslash_backslash, the result of query "select * from text_comma_backslash_backslash" is:+------+------+------+------+ | col1 | col2 | col3 | col4 | +------+------+------+------+ | one | two | 3 | 4 | | one | NULL | NULL | NULL | | | one | NULL | 3 | | | NULL | NULL | NULL | | | two | 3 | NULL | | | NULL | NULL | NULL | | | NULL | NULL | NULL | | | one | NULL | 3 | | | NULL | NULL | NULL | | | NULL | NULL | NULL | | | NULL | NULL | NULL | | | two | 3 | 4 | +------+------+------+------+ Again, this is not what we expected. For result of second row "| one | NULL | NULL | NULL |", the original value is "one\,one,two,3,4". We expect to treat first backslash as escaped character, however, the text parser take it as the tuple delimiter, so the value of col2, col3 and col4 turn to be 'NULL'. In conclusion,in the current branch cdh5-trunk(commit id: 50a7ba059), it seems that it cat't handle these two corner cases properly. I'm wondering whether have you ever test these two cases. To make multi-byte field delimiter works properly, I add these five restrictions in CreateTableStmt.java::analyzeRowFormat(): 1. Delimiters can't be an empty string 2. Tuple delimiter can't be the first byte of field delimiter 3. Escape character can't be the first byte of field delimiter 4. Escape character and tuple delimiter can't the be same value 5. Terminators can't contains '\0'. After adding restriciton 5, when the terminator contains '\0', it fails in a more orderly way(throws an AnalysisException with error message) than the current code does(throws an ImpalaRuntimeException with many stack trace). What's more, in this patch, we can use only standard ASCII characters(with decimal value from 0 to 127) in ascii or octal format to set filed terminator, but not extended ASCII characters(with decimal value from 128 to 255) or standard ASCII characters in unicode, decimal or hexadecimal format. For example, to make standard ASCII characters "#@#" as field delimiter, we can use fields terminated by '#\100\043', but not '\u0023', '35', '\x23' respectively. I can't find a solution to unescape decimal and hexadecimal string. And it seems that there's a bug for SqlParser.parse() to parse unicode string and octol value of extended ASCII characters. Yes, we should discuss this bug in another discussion. And for now, I suggest that we should use only standard ASCII characters as field delimiter. ------------------ ???????? ------------------ ??????: Jim Apple <[email protected]>; ????????: 2016-07-27 04:19:23 ??????:Yuanhao Luo <[email protected]> ????:dev@impala <[email protected]> ????: Re: Re?? IMPALA-2428 Support multiple-character string as the field delimiter On Mon, Jul 25, 2016 at 7:30 AM, Yuanhao Luo <[email protected]> wrote: Hello, Jim Apple: I can't find any tests for case escape character is the same value as field delimiter and case escape character is the same value as tuple delimiter from testdata/workloads/functional-query/queries/QueryTest/delimited-text.test. I ran some tests on branch cdh5-trunk(commit id: 50a7ba059), and logs below show that even though we have already add warning "WARNINGS: Field delimiter and escape character have same value. Escape character will be ignored" and "WARNINGS: Line delimiter and escape character have same value: . Escape character will be ignored" for these two corner cases, but codes don't work as expected. Can you describe what behavior you expected? It's a little difficult for me to fix these corner cases, so in my next patch, I'm going to enhance restriction as below: 1. Delimiters can't be an empty string. 2. Tuple delimiter can't be the first byte of field delimiter. 3. Escape char can't be the first byte of field delimiter. 4. Escape char and tuple delimiter can't be the same. 5. Delimiters can't contain '\0'.Whenever you change your planned design, we need to re-evaluate it against the current code to see how it is different, especially any changes that break currently working code. What's more, in my tests, I found that sql-parser.cup can't parse unicode and octol of extended ASCII character(with decimal value from 128 to 255) correctly. For example, if we want to set "#@#" as fields terminator, we can use fields terminated by '\u0023\100\043' , which refers to ASCII #@# respectively. The parse result is right. However, when I want to set double thorn(extended ASCII character with decimal value 254) as field terminator, for example fields terminated by '\u00fe\376', it turns out to '\u00A4376' when I run 'describe extended table'. I have report this issue in IMPALA-3777 already. I don't see a reason to bring that bug into this discussion. Do you?
