Jim Apple has posted comments on this change. Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
Patch Set 1: (12 comments) http://gerrit.cloudera.org:8080/#/c/3314/1//COMMIT_MSG Commit Message: PS1, Line 11: allowed to have only one byte. "allowed to have only one byte" and "only allowed to have one byte" mean slightly different things. I assume you mean the latter? http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc File be/src/exec/delimited-text-parser-test.cc: Can you also add some end-to-end tests, please? PS1, Line 82: filed "field" http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc File be/src/exec/delimited-text-parser.cc: Line 46: if (field_delim.size() == 1) { How should DelimitedTextParse behave if field_delim.size() == 0? Line 145: || **byte_buffer_ptr == collection_item_delim_) { This condition should probably go first, since it is cheaper. http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.h File be/src/exec/delimited-text-parser.h: Line 46: char tuple_delim, const string& field_delim_ = string("\0", 1), Are there any restrictions about field_delim_? For instance, is the string "\0\0" allowed? Can tuple_delim be the first character of field_delim_? What about the second? Line 153: /// SSE(xmm) register containing the tuple search character(s). This and the other comments about SSE should be clarified to explain how they behave when field_delim_ is not a single character. http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-sequence-table-writer.cc File be/src/exec/hdfs-sequence-table-writer.cc: Line 202: || str_val->ptr[i] == escape_char_) { This condition should probably go first, since it is cheaper http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: Line 210: || str_val->ptr[i] == escape_char_)) { This condition should probably go first, since it is easier to check. http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: Line 268: String fieldDelim = StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter()); Can you explain why it makes sense to use of unescapeJava here? I notice it isn't used elsewhere in the analyzer. http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java: PS1, Line 44: /h missing space between / and h Line 171: if (escapeChar == (byte)fieldDelim.charAt(0) || Should you check before here that fieldDelim is not null and has length > 0? -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-HasComments: Yes
