Yuanhao Luo has posted comments on this change. Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
Patch Set 1: (11 comments) http://gerrit.cloudera.org:8080/#/c/3314/1//COMMIT_MSG Commit Message: PS1, Line 11: allowed to have only one byte. > "allowed to have only one byte" and "only allowed to have one byte" mean sl Done http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc File be/src/exec/delimited-text-parser-test.cc: > Can you also add some end-to-end tests, please? Done PS1, Line 82: filed > "field" Done http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc File be/src/exec/delimited-text-parser.cc: Line 46: if (field_delim.size() == 1) { > How should DelimitedTextParse behave if field_delim.size() == 0? size of field_delim could not be 0. We have checked that in sql-parser.y and HdfsStorageDescriptor.java. Line 145: || **byte_buffer_ptr == collection_item_delim_) { > This condition should probably go first, since it is cheaper. Done http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.h File be/src/exec/delimited-text-parser.h: Line 46: char tuple_delim, const string& field_delim_ = string("\0", 1), > Are there any restrictions about field_delim_? For instance, is the string I have tested field_delim_ as "\0\0", that's ok. For now, tuple_delim could not be the first byte of field_delim. Refer to https://gerrit.cloudera.org/#/c/3314/1/be/src/exec/delimited-text-parser.cc@139 , if the tuple_delim is as the same as first byte of field_delim, then if field_delim occurs, it would be parsed as a tuple delim and the left string would be parsed as a new tuple, but this is not we want. http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-sequence-table-writer.cc File be/src/exec/hdfs-sequence-table-writer.cc: Line 202: || str_val->ptr[i] == escape_char_) { > This condition should probably go first, since it is cheaper Done http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: Line 210: || str_val->ptr[i] == escape_char_)) { > This condition should probably go first, since it is easier to check. Done http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: Line 268: String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter()); > Can you explain why it makes sense to use of unescapeJava here? I notice it Just as parseDelim(), it turns escaped string into un-escaped string.For example, after calling String e = StringEscapeUtils.unescapeJava("\\043\\100\\043"); the value of e is "#@#"; http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java: PS1, Line 44: /h > missing space between / and h Done Line 171: if (escapeChar == (byte)fieldDelim.charAt(0) || > Should you check before here that fieldDelim is not null and has length > 0 Done -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Yuanhao Luo <[email protected]> Gerrit-HasComments: Yes
