Yuanhao Luo has uploaded a new patch set (#4). Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
IMPALA-2428: Support multiple-character string as the field delimiter This commit add support for multi-byte string as the field delimiter. Mean while other separators(e.g. escape char, line delimiter and key-map delimiter) are only allowed to have one byte. There some constrains on terminators: 1. field delimiter could not be empty 2. tuple delimiter could not be the first byte of field delimiter 3. escape character could not be the first byte of field delimiter 4. terminators could not contains '\0' TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this commit supports multi-byte field delimiter via direct string matching. As a result, we would get poor performance if the multi-byte field delimiter is relatively long. Maybe we can get better performance via better string matching algorithm such as KMP. Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e --- M be/src/exec/delimited-text-parser-test.cc M be/src/exec/delimited-text-parser.cc M be/src/exec/delimited-text-parser.h M be/src/exec/delimited-text-parser.inline.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M be/src/runtime/descriptors.h M common/thrift/CatalogObjects.thrift M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java A testdata/data/text-commacomma-backslash-newline.txt A testdata/data/text-dollarhash-hash-pipe.txt A testdata/data/text-hashathash-ecirc-newline.txt M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test M tests/query_test/test_delimited_text.py 21 files changed, 405 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/4 -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Yuanhao Luo <[email protected]>
