Yuanhao Luo has posted comments on this change. Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/3314/4//COMMIT_MSG Commit Message: Line 15: As a result, we would get poor performance if the multi-byte field > What happens today if tuple delimiters and field delimiters are the same ch After we get the first column, the following filed delimiter would be parsed as tuple delimiter, so this record ends too early and rest columns would be set to NULL. Test log in https://issues.cloudera.org/browse/IMPALA-2428 has confirmed that. Line 17: better string matching algorithm such as KMP. > Today, are terminators allowed to contain '\0'? No, if there were '\0' in terminators, it would failed to persist metadata to postgres. Just see the log in https://issues.cloudera.org/browse/IMPALA-2428. Besides, I was wondering whether have you test the statement "create table tb1(id int) row format delimited escaped by '\0';" or "create table tb1(id int) row format delimited lines terminated by '\0';" before, both of them failed in my tests with the exception arise by postgres:"ERROR: invalid byte sequence for encoding "UTF8": 0x00". Even if I use "--encoding=LATIN1" to init postgres db, the same error occurs. So in my commit, I add this restriction to terminators of text file. http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc File be/src/exec/delimited-text-parser.cc: Line 46: if (field_delim.size() == 1) { > I see you have taken that out of the parser now. Why is that? It didn't take effects in my tests. I'm going to move all row format constrains checking to CreateTableStmt::analyzeRowFormat(), what do you think? http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: Line 268: String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter()); > Why isn't it already used elsewhere? At the beginning I want to replace parseDelim() with this function, but problems occurs. With commons-lang3, result of StringEscapeUtils.unescapeJava("\\043\\100\\043") is "#@#",result of StringEscapeUtils.unescapeJava("\\") is ""; but in commons-lang, the results are "043100043",'\',respectively. What's more, with this function I can't convert decimal string into byte(s). For example, with Byte.parseByte("35") we get 35, which is a decimal representation of '#', but with StringEscapeUtils.unescapeJava("35"), we just get string "35". Originally, we can use escaped by '35' or '\u0023' or '\043' or '#' to make '#' as escape char, but with this function we can only use the later three but not first '35'. It seems that to support multi-byte delimiter, we can't use decimal representation. For example, to use '###' as field delimiter, we can use fields terminated by '\u0023\043#',but not decimal 35. http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java: Line 171: if (escapeChar == (byte)fieldDelim.charAt(0) || > I do not see that fix in this code. I will move all constrains checking to CreateTableStmt:analyzeRowFormat(), so if any constrain breaks, we can throw an exception to warn user, but not logging here. http://gerrit.cloudera.org:8080/#/c/3314/4/tests/query_test/test_delimited_text.py File tests/query_test/test_delimited_text.py: Line 56: fields terminated by '\002' > Please name this more appropriately - it only tests failure modes right now what about "test_break_row_format_constrain()"? -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Yuanhao Luo <[email protected]> Gerrit-HasComments: Yes
