Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field 
delimiter
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3314/4//COMMIT_MSG
Commit Message:

Line 15: As a result, we would get poor performance if the multi-byte field
> What happens today if tuple delimiters and field delimiters are the same ch
After we get the first column, the following filed delimiter would be parsed as 
tuple delimiter, so this record ends too early and rest columns would be set to 
NULL. Test log in  https://issues.cloudera.org/browse/IMPALA-2428 has confirmed 
that.


Line 17: better string matching algorithm such as KMP.
> Today, are terminators allowed to contain '\0'?
No, if there were '\0' in terminators, it would failed to persist metadata to 
postgres. Just see the log in https://issues.cloudera.org/browse/IMPALA-2428. 
Besides, I was wondering whether have you test the statement "create table 
tb1(id int) row format delimited escaped by '\0';" or "create table tb1(id int) 
row format delimited lines terminated by '\0';" before, both of them failed in 
my tests with the exception arise by  postgres:"ERROR: invalid byte sequence 
for encoding "UTF8": 0x00". Even if I use "--encoding=LATIN1" to init postgres 
db, the same error occurs. So in my commit, I add this restriction to 
terminators of text file.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   if (field_delim.size() == 1) {
> I see you have taken that out of the parser now. Why is that?
It didn't take effects in my tests. I'm going to move all row format constrains 
checking to CreateTableStmt::analyzeRowFormat(), what do you think?


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = 
StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
> Why isn't it already used elsewhere?
At the beginning I want to replace parseDelim() with this function, but 
problems occurs.
With commons-lang3, result of StringEscapeUtils.unescapeJava("\\043\\100\\043") 
is "#@#",result of StringEscapeUtils.unescapeJava("\\") is "";
but in commons-lang, the results are "043100043",'\',respectively. What's more, 
with this function I can't convert decimal string into byte(s). For example, 
with Byte.parseByte("35") we get 35, which is a decimal representation of '#', 
but with StringEscapeUtils.unescapeJava("35"), we just get string "35". 
Originally, we can use escaped by '35' or '\u0023' or '\043' or '#' to make '#' 
as escape char, but with this function we can only use the later three but not 
first '35'. It seems that to support multi-byte delimiter, we can't use decimal 
representation. For example, to use '###' as field delimiter, we can use fields 
terminated by '\u0023\043#',but not decimal 35.


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
> I do not see that fix in this code.
I will move all constrains checking to CreateTableStmt:analyzeRowFormat(), so 
if any constrain breaks, we can throw an exception to warn user, but not 
logging here.


http://gerrit.cloudera.org:8080/#/c/3314/4/tests/query_test/test_delimited_text.py
File tests/query_test/test_delimited_text.py:

Line 56:       fields terminated by '\002'
> Please name this more appropriately - it only tests failure modes right now
what about "test_break_row_format_constrain()"?


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Yuanhao Luo <[email protected]>
Gerrit-HasComments: Yes

Reply via email to