Yuanhao Luo has uploaded a new patch set (#4).

Change subject: IMPALA-2428: Support multiple-character string as the field 
delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There some constrains on terminators:
1. field delimiter could not be empty
2. tuple delimiter could not be the first byte of field delimiter
3. escape character could not be the first byte of field delimiter
4. terminators could not contains '\0'

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 405 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Yuanhao Luo <[email protected]>

Reply via email to