Yuanhao Luo has uploaded a new patch set (#7). Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
IMPALA-2428: Support multiple-character string as the field delimiter This commit add support for multi-byte string as the field delimiter. Meanwhile other separators(e.g. escape char, line delimiter and key-map delimiter) are only allowed to have one byte. There are some constrains on terminators for text file: 1. Delimiters can't be an empty string 2. Tuple delimiter can't be the first byte of field delimiter 3. Escape character can't be the first byte of field delimiter 4. Escape character and tuple delimiter can't the be same value 5. Terminators can't contains '\0' Warning: You can use only standard ASCII characters(with decimal value from 0 to 127) in ascii or octal format to set filed terminator, but not extended ASCII characters(with decimal value from 128 to 255) or standard ASCII characters in unicode, decimal or hexadecimal format. For example, to make standard ASCII characters "#@#" as field delimiter, you can use fields terminated by '#\100\043', but not '\u0023', '35', '\x23' respectively. I didn't find a solution to unescape decimal and hexadecimal string. And there's a bug for SqlParser.parse() to parse unicode string and octol value of extended ASCII characters. I have opened a issue in https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we can also use unicode string and octol string for extended ASCII characters. Other one-byte terminators are still allow to use decimal value. TODO: Thinking it's hard to use SSE4_2 for multi-byte matching, this commit supports multi-byte field delimiter via direct string matching. As a result, we would get poor performance if the multi-byte field delimiter is relatively long. Maybe we can add a constrain on the length of filed terminator. Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e --- M be/src/exec/delimited-text-parser-test.cc M be/src/exec/delimited-text-parser.cc M be/src/exec/delimited-text-parser.h M be/src/exec/delimited-text-parser.inline.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M be/src/runtime/descriptors.h M common/thrift/CatalogObjects.thrift M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java A testdata/data/text-commacomma-backslash-newline.txt A testdata/data/text-dollarhash-hash-pipe.txt A testdata/data/text-hashathash-ecirc-newline.txt M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test M tests/query_test/test_delimited_text.py 21 files changed, 460 insertions(+), 91 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/7 -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 7 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Yuanhao Luo <[email protected]>
