[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Yuanhao Luo (Code Review) Tue, 21 Jun 2016 04:01:41 -0700

Yuanhao Luo has posted comments on this change.

Change subject: IMPALA-2428: Support multiple-character string as the field 
delimiter
......................................................................



Patch Set 1:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/3314/1//COMMIT_MSG
Commit Message:

PS1, Line 11: allowed to have only one byte.
> "allowed to have only one byte" and "only allowed to have one byte" mean sl
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser-test.cc
File be/src/exec/delimited-text-parser-test.cc:

> Can you also add some end-to-end tests, please?
Done


PS1, Line 82: filed
> "field"
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.cc
File be/src/exec/delimited-text-parser.cc:

Line 46:   if (field_delim.size() == 1) {
> How should DelimitedTextParse behave if field_delim.size() == 0?
size of field_delim could not be 0. We have checked that in sql-parser.y and 
HdfsStorageDescriptor.java.


Line 145:           || **byte_buffer_ptr == collection_item_delim_) {
> This condition should probably go first, since it is cheaper.
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/delimited-text-parser.h
File be/src/exec/delimited-text-parser.h:

Line 46:       char tuple_delim, const string& field_delim_ = string("\0", 1),
> Are there any restrictions about field_delim_? For instance, is the string 
I have tested field_delim_ as "\0\0", that's ok. For now, tuple_delim could not 
be the first byte of field_delim. Refer to 
https://gerrit.cloudera.org/#/c/3314/1/be/src/exec/delimited-text-parser.cc@139 
, if the tuple_delim is as the same as first byte of field_delim, then if 
field_delim occurs, it would be parsed as a tuple delim and the left string 
would be parsed as a new tuple, but this is not we want.


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

Line 202:         || str_val->ptr[i] == escape_char_) {
> This condition should probably go first, since it is cheaper
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

Line 210:           || str_val->ptr[i] == escape_char_)) {
> This condition should probably go first, since it is easier to check.
Done


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java:

Line 268:     String fieldDelim = 
StringEscapeUtils.unescapeJava(rowFormat_.getFieldDelimiter());
> Can you explain why it makes sense to use of unescapeJava here? I notice it
Just as parseDelim(), it turns escaped string into un-escaped string.For 
example, after calling String e = 
StringEscapeUtils.unescapeJava("\\043\\100\\043"); the value of e is "#@#";


http://gerrit.cloudera.org:8080/#/c/3314/1/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java:

PS1, Line 44: /h
> missing space between / and h
Done


Line 171:     if (escapeChar == (byte)fieldDelim.charAt(0) ||
> Should you check before here that fieldDelim is not null and has length > 0
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Yuanhao Luo <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2428: Support multiple-character string as the field delimiter

Reply via email to