Re?? IMPALA-2428 Support multiple-character string as the field delimiter

Yuanhao Luo Fri, 22 Jul 2016 23:38:26 -0700

Hello, everyont:
I have pushed a commit to Gerrit which supports for taking multiple-character 
string as the filed delimiter.
Tests show that it works as expected. But there are still some constrains for 
delimiters. I will illustrate below and give some test logs.

Terminator constrains

Field terminator can't be an empty string

Line terminators can't be the first byte of field terminator.

Escape character can't be the first byte of field terminator.

Terminators can't not contain '\0' for text file.

 All this constrains are added in CreateTableStmt.java:analyzeRowFormat() and 
they are just used for text file.

Logs:

Field terminator is an empty sting
[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by "";
Query: create table tb1(id int) row format delimited fields terminated by ""
ERROR: AnalysisException: Field delimiter can't be an empty string
[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '';
Query: create table tb1(id int) row format delimited fields terminated by ''
ERROR: AnalysisException: Field delimiter can't be an empty string

Line terminator is the first byte of filed delimiter.
If the tuple delimiter(lines terminate), (e.g. '#') is the same as first byte 
of field delimiter(e.g. "#@#"), according to code 
https://gerrit.cloudera.org/#/c/3314/2/be/src/exec/delimited-text-parser.cc@143 
, given data "1#@#CLOUDERA#@#1#" and table schema (id int,name string,age 
int),the parsed result would be:

id          name        age
1           NULL        NULL (reach first '#', means tuple ends, so name and 
age are null)
NULL    NULL        NULL ('@' can't turn into int, so id is NULL. name and age 
as above)
NULL    NULL        NULL ('CLOUDERA' can't turn into int, so id is NULL, name 
and age as above)
NULL    NULL        NULL ('@' can't turn into int, so id is NULL. name and age 
as above)
1           NULL        NULL

As above shows, the result means nothing. So in this commit, tuple delimiter 
can't be the same as first byte of field delimiter.
According to code: 
https://gerrit.cloudera.org/#/c/3314/2/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java@177
If we found that the tuple delimiter is the first byte of field delimiter, we 
would replace tuple delimiter with DEAULT_LINE_DELIM('\n'). Test shows that 
this make sence.

[root@nobida147 workspace]# cat tuple_in_field_oneline.dat 
1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@#
[root@nobida147 workspace]# cat tuple_in_field_multiline.dat 
1#@#CLOUDERA#@#1
2#@#IMPALA#@#2
3#@##@#3
#@#id null#@#4
5#@#age null#@#
[nobida147:21000] >  load data inpath 
'hdfs://localhost:20500/user/root/tuple_in_field_oneline.dat' into table 
tuple_in_field;
Query: load data inpath 
'hdfs://localhost:20500/user/root/tuple_in_field_oneline.dat' into table 
tuple_in_field
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Fetched 1 row(s) in 5.49s
[nobida147:21000] >  load data inpath 
'hdfs://localhost:20500/user/root/tuple_in_field_multiline.dat' into table 
tuple_in_field;
Query: load data inpath 
'hdfs://localhost:20500/user/root/tuple_in_field_multiline.dat' into table 
tuple_in_field
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Fetched 1 row(s) in 0.23s
[nobida147:21000] > select * from tuple_in_field;
Query: select * from tuple_in_field
+------+----------+------+
| id   | name     | age  |
+------+----------+------+
| 1    | CLOUDERA | NULL |
| 1    | CLOUDERA | 1    |
| 2    | IMPALA   | 2    |
| 3    |          | 3    |
| NULL | id null  | 4    |
| 5    | age null | NULL |
+------+----------+------+
WARNINGS: Error converting column: 2 TO INT (Data is: 1#2)
file: 
hdfs://localhost:20500/test-warehouse/db3.db/tuple_in_field/tuple_in_field_oneline.dat
record: 1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@#
Fetched 6 row(s) in 0.74s

For "1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@#", 
we have replace '#' with '\n' as tuple delimiter, so when we come to "1#2#@#", 
the first '#' wouldn't be parsed as tuple delimiter, therefor  when we try turn 
"1#2" into an int column, warning occurs.

After adding the constrain, an exception will be thrown to warn the user:

[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '#\043' lines terminated by '#';
Query: create table tb1(id int) row format delimited fields terminated by 
'#\043' lines terminated by '#'
ERROR: AnalysisException: Line delimiter can't be the first byte of field 
delimiter, lineDelim: #, fieldDelim: ##

Escape character is the first byte of field delimiter.
If escape character is the first byte of filed delimiter, when we get this 
character, we don't know whether is the escape character or the beginning of 
field delimiter. 
After adding the constrain, an exception will be thrown to warn the user:

[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '#\043' escaped by '#';
Query: create table tb1(id int) row format delimited fields terminated by 
'#\043' escaped by '#'
ERROR: AnalysisException: Escape character can't be the first byte of field  
delimiter, escapeChar: #, fieldDelim: ##
[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '#\043' lines terminated by '#';
Query: create table tb1(id int) row format delimited fields terminated by 
'#\043' lines terminated by '#'
ERROR: AnalysisException: Line delimiter can't be the first byte of field 
delimiter, lineDelim: #, fieldDelim: ##

Terminators including \0
I have try to create table with '\0' as field delimiter, or '\0' as escape 
character, but both failed. Log shows "ERROR: invalid byte sequence for 
encoding "UTF8": 0x00". Even if I use "--encoding=LATIN1" to init postgres db, 
the same error occurs.
I was wondering whether you have tested these corner cases before?

[nobida147:21000] > create table single_null(id int, name string, age int) row 
format delimited fields terminated by "\u0000";
Query: create table single_null(id int, name string, age int) row format 
delimited fields terminated by "\u0000"
ERROR: 
ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : 
INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES 
(?,?,?) 
  at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
  at 
org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)
  at 
org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
  at 
org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:902)
  at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
  at com.sun.proxy.$Proxy0.createTable(Unknown Source)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1469)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1502)
  at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
  at com.sun.proxy.$Proxy3.create_table_with_environment_context(Unknown Source)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9267)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9251)
  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118)
  at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
NestedThrowablesStackTrace:
org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO 
"SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) 
  at 
org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1078)
  at 
org.datanucleus.store.rdbms.scostore.JoinMapStore.putAll(JoinMapStore.java:220)
  at 
org.datanucleus.store.rdbms.mapping.java.MapMapping.postInsert(MapMapping.java:137)
  at 
org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:519)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143)
  at 
org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784)
  at 
org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2314)
  at 
org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObjectAsValue(PersistableMapping.java:567)
  at 
org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObject(PersistableMapping.java:326)
  at 
org.datanucleus.store.rdbms.fieldmanager.ParameterSetter.storeObjectField(ParameterSetter.java:193)
  at 
org.datanucleus.state.JDOStateManager.providedObjectField(JDOStateManager.java:1269)
  at 
org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoProvideField(MStorageDescriptor.java)
  at 
org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoProvideFields(MStorageDescriptor.java)
  at 
org.datanucleus.state.JDOStateManager.provideFields(JDOStateManager.java:1346)
  at 
org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:289)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143)
  at 
org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784)
  at 
org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2314)
  at 
org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObjectAsValue(PersistableMapping.java:567)
  at 
org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObject(PersistableMapping.java:326)
  at 
org.datanucleus.store.rdbms.fieldmanager.ParameterSetter.storeObjectField(ParameterSetter.java:193)
  at 
org.datanucleus.state.JDOStateManager.providedObjectField(JDOStateManager.java:1269)
  at org.apache.hadoop.hive.metastore.model.MTable.jdoProvideField(MTable.java)
  at org.apache.hadoop.hive.metastore.model.MTable.jdoProvideFields(MTable.java)
  at 
org.datanucleus.state.JDOStateManager.provideFields(JDOStateManager.java:1346)
  at 
org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:289)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167)
  at 
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143)
  at 
org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784)
  at 
org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219)
  at 
org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065)
  at 
org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913)
  at 
org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217)
  at 
org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727)
  at 
org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
  at 
org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:902)
  at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
  at com.sun.proxy.$Proxy0.createTable(Unknown Source)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1469)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1502)
  at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
  at com.sun.proxy.$Proxy3.create_table_with_environment_context(Unknown Source)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9267)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9251)
  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  at 
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118)
  at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
*-------------Attention here---------------------*
*Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for 
encoding "UTF8": 0x00*
  at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
  at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
  at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
  at 
com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
  at 
org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399)
  at 
org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439)
  at 
org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069)
  ... 68 more

[nobida147:21000] > create table single_null(id int, name string, age int) row 
format delimited lines terminated by '\0';
Query: create table single_null(id int, name string, age int) row format 
delimited lines terminated by '\0'
ERROR: 
ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : 
INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES 
(?,?,?) 
.
.
.
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for 
encoding "UTF8": 0x00
  at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
  at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
  at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
  at 
com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
  at 
org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399)
  at 
org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439)
  at 
org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069)
  ... 68 more

[nobida147:21000] > create table single_null(id int, name string, age int) row 
format delimited escaped by '\0';
Query: create table single_null(id int, name string, age int) row format 
delimited escaped by '\0'
ERROR: 
ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : 
INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES 
(?,?,?) 
.
.
.
.
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for 
encoding "UTF8": 0x00
  at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
  at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
  at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388)
  at 
org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334)
  at 
com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205)
  at 
org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399)
  at 
org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439)
  at 
org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069)
  ... 68 moreIt seems that the error occurs due to postgres. As for now, this 
commit doesn't support delimiters(filed, tuple or escape char) that including 
\0.

After adding the constrain, it would throw an exception to warn the user:
[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '\0';
Query: create table tb1(id int) row format delimited fields terminated by '\0'
ERROR: AnalysisException: Terminators can't contains \0
[nobida147:21000] > create table tb1(id int) row format delimited fields 
terminated by '#\043' lines terminated by '\0';
Query: create table tb1(id int) row format delimited fields terminated by 
'#\043' lines terminated by '\0'
ERROR: AnalysisException: Terminators can't contains \0

Mulit-byte field delimiter are also supported for ther file formats but barely 
tested.

[nobida147:21000] > create table ccbn_par(col1 string, col2 string, col3 int, 
col4 int) row format delimited fields terminated by ',,' escaped by '\\' lines 
terminated by '\n' stored as parquet;
Query: create table ccbn_par(col1 string, col2 string, col3 int, col4 int) row 
format delimited fields terminated by ',,' escaped by '\\' lines terminated by 
'\n' stored as parquet
Fetched 0 row(s) in 0.14s
[nobida147:21000] > insert into ccbn_par select * from ccbn;
Query: insert into ccbn_par select * from ccbn
Inserted 3 row(s) in 5.14s
[nobida147:21000] > select * from ccbn_par;
Query: select * from ccbn_par
+--------------+--------------+------+------+
| col1         | col2         | col3 | col4 |
+--------------+--------------+------+------+
| abc , abc    | xyz \ xyz    | 1    | 2    |
| abc ,,, abc  | xyz \\\ xyz  | 3    | 4    |
| abc \,\, abc | xyz ,\,\ xyz | 5    | 6    |
+--------------+--------------+------+------+
Fetched 3 row(s) in 0.13s
[nobida147:21000] > select * from ccbn;
Query: select * from ccbn
+--------------+--------------+------+------+
| col1         | col2         | col3 | col4 |
+--------------+--------------+------+------+
| abc , abc    | xyz \ xyz    | 1    | 2    |
| abc ,,, abc  | xyz \\\ xyz  | 3    | 4    |
| abc \,\, abc | xyz ,\,\ xyz | 5    | 6    |
+--------------+--------------+------+------+
Fetched 3 row(s) in 0.13s
[nobida147:21000] > create table dhhp_par like dhhp stored as parquet;
Query: create table dhhp_par like dhhp stored as parquet
Fetched 0 row(s) in 0.13s
[nobida147:21000] > insert into dhhp_par select * from dhhp;
Query: insert into dhhp_par select * from dhhp
Inserted 3 row(s) in 0.34s
[nobida147:21000] > select * from dhhp;
Query: select * from dhhp
+--------------+--------------+------+------+
| col1         | col2         | col3 | col4 |
+--------------+--------------+------+------+
| abc $ abc    | xyz # xyz    | 1    | 2    |
| abc $$$ abc  | xyz ### xyz  | 3    | 4    |
| abc #$#$ abc | xyz $#$# xyz | 5    | 6    |
+--------------+--------------+------+------+
Fetched 3 row(s) in 0.13s
[nobida147:21000] > select * from dhhp_par;
Query: select * from dhhp_par
+--------------+--------------+------+------+
| col1         | col2         | col3 | col4 |
+--------------+--------------+------+------+
| abc $ abc    | xyz # xyz    | 1    | 2    |
| abc $$$ abc  | xyz ### xyz  | 3    | 4    |
| abc #$#$ abc | xyz $#$# xyz | 5    | 6    |
+--------------+--------------+------+------+
Fetched 3 row(s) in 0.13s
[nobida147:21000] > create table parquet_commacomma_backslash_newline(col1 
string, col2 string, col3 int, col4 int) row format delimited fields terminated 
by ',,' escaped by '\\' lines terminated by '\n' stored as parquet;
Query: create table parquet_commacomma_backslash_newline(col1 string, col2 
string, col3 int, col4 int) row format delimited fields terminated by ',,' 
escaped by '\\' lines terminated by '\n' stored as parquet
Fetched 0 row(s) in 0.13s
[nobida147:21000] > insert into parquet_commacomma_backslash_newline select  
*from text_commacomma_backslash_newline;
Query: insert into parquet_commacomma_backslash_newline select  *from 
text_commacomma_backslash_newline
Inserted 5 row(s) in 3.40s
[nobida147:21000] > select * from text_commacomma_backslash_newline;
Query: select * from text_commacomma_backslash_newline
+----------+------+------+------+
| col1     | col2 | col3 | col4 |
+----------+------+------+------+
| one      | two  | 3    | 4    |
| one,one  | two  | 3    | 4    |
| one\     | two  | 3    | 4    |
| one\,one | two  | 3    | 4    |
| one\\    | two  | 3    | 4    |
+----------+------+------+------+
Fetched 5 row(s) in 0.13s
[nobida147:21000] > select * from parquet_commacomma_backslash_newline;
Query: select * from parquet_commacomma_backslash_newline
+----------+------+------+------+
| col1     | col2 | col3 | col4 |
+----------+------+------+------+
| one      | two  | 3    | 4    |
| one,one  | two  | 3    | 4    |
| one\     | two  | 3    | 4    |
| one\,one | two  | 3    | 4    |
| one\\    | two  | 3    | 4    |
+----------+------+------+------+
Fetched 5 row(s) in 0.13s
[nobida147:21000] > create table parquet_hashathash_ecirc_newline(col1 string, 
col2 string, col3 int, col4 int) row format delimited fields terminated by 
'#@#' escaped by '-22' lines terminated by '\n' stored as parquet;
Query: create table parquet_hashathash_ecirc_newline(col1 string, col2 string, 
col3 int, col4 int) row format delimited fields terminated by '#@#' escaped by 
'-22' lines terminated by '\n' stored as parquet
Fetched 0 row(s) in 0.12s
[nobida147:21000] > insert into parquet_hashathash_ecirc_newline select * from 
text_hashathash_ecirc_newline;
Query: insert into parquet_hashathash_ecirc_newline select * from 
text_hashathash_ecirc_newline
Inserted 5 row(s) in 5.07s
[nobida147:21000] > select * from parquet_hashathash_ecirc_newline;
Query: select * from parquet_hashathash_ecirc_newline
+------------+------+------+------+
| col1       | col2 | col3 | col4 |
+------------+------+------+------+
| one        | two  | 3    | 4    |
| one#@#one  | two  | 3    | 4    |
| one?1?7       | two  | 3    | 4    |
| one?1?7#@#one | two  | 3    | 4    |
| one?1?7?1?7      | two  | 3    | 4    |
+------------+------+------+------+
Fetched 5 row(s) in 0.13s
[nobida147:21000] > select * from text_hashathash_ecirc_newline;
Query: select * from text_hashathash_ecirc_newline
+------------+------+------+------+
| col1       | col2 | col3 | col4 |
+------------+------+------+------+
| one        | two  | 3    | 4    |
| one#@#one  | two  | 3    | 4    |
| one?1?7       | two  | 3    | 4    |
| one?1?7#@#one | two  | 3    | 4    |
| one?1?7?1?7      | two  | 3    | 4    |
+------------+------+------+------+
Fetched 5 row(s) in 0.13s

Field terminator setting
For now, you can use octal or ascii character to set filed delimiter. 
For example, If you wan't set "##" as filed delimiter, you can use: fields 
terminated by '\043#'. You can't use unicode, hexadecimal or decimal, '\u0023', 
'\x23' and '35' respectively. I can't find a solution to un-escape  hexadecimal 
and decimal. 
And there's a bug in front-end SqlParser for unicode. I have open a issue in 
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we can also 
use unicode to set filed terminator.

------------------ ???????? ------------------
??????: "Jim Apple";<[email protected]>;
????????: 2016??7??22??(??????) ????6:22
??????: "dev"<[email protected]>; 
????: "??????"<[email protected]>; 
????: Re: IMPALA-2428 Support multiple-character string as the field delimiter

+cc:[email protected], in case they are not on the list.

On Wed, Jul 13, 2016 at 6:19 PM, Jim Apple <[email protected]> wrote:
> Can you please put your design here, rather than just telling us where
> you pasted it?
>
> On Tue, Jul 12, 2016 at 4:33 AM, Yuanhao Luo
> <[email protected]> wrote:
>> Hello, everyone. To fix IMPALA-2428 , I have push a commit to gerrit.
>> I have illustrated my design in detail in IMPALA-2428 and there are also 
>> some test logs. Please read them carefully.
>>
>>
>> The key point is that there are four constrains on multi-byte field 
>> terminator as below:
>>
>> Field terminator can't be an empty string
>>
>> Line terminators can't be the first byte of field terminator.
>>
>> Escape character can't be the first byte of field terminator.
>>
>> Terminators can't not contain '\0' for text file.
>>
>> As suggested by Jim Apple, I start this discussion in mailing list to do a 
>> design review.
>> Are there any problems in my design?

Re?? IMPALA-2428 Support multiple-character string as the field delimiter

Reply via email to