Yuanhao Luo has uploaded a new patch set (#7).

Change subject: IMPALA-2428: Support multiple-character string as the field 
delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Meanwhile other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There are some constrains on terminators for text file:
1. Delimiters can't be an empty string
2. Tuple delimiter can't be the first byte of field delimiter
3. Escape character can't be the first byte of field delimiter
4. Escape character and tuple delimiter can't the be same value
5. Terminators can't contains '\0'

Warning: You can use only standard ASCII characters(with decimal value
from 0 to 127) in ascii or octal format to set filed terminator, but
not extended ASCII characters(with decimal value from 128 to 255) or
standard ASCII characters in unicode, decimal or hexadecimal format.
For example, to make standard ASCII characters "#@#" as field delimiter,
you can use fields terminated by '#\100\043', but not '\u0023', '35', '\x23'
respectively. I didn't find a solution to unescape decimal and hexadecimal
string. And there's a bug for SqlParser.parse() to parse unicode string and
octol value of extended ASCII characters. I have opened a issue in
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we
can also use unicode string and octol string for extended ASCII characters.

Other one-byte terminators are still allow to use decimal value.

TODO: Thinking it's hard to use SSE4_2 for multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can add a constrain on the length
of filed terminator.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 460 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Yuanhao Luo <[email protected]>

Reply via email to