Hello, everyont: I have pushed a commit to Gerrit which supports for taking multiple-character string as the filed delimiter. Tests show that it works as expected. But there are still some constrains for delimiters. I will illustrate below and give some test logs.
Terminator constrains Field terminator can't be an empty string Line terminators can't be the first byte of field terminator. Escape character can't be the first byte of field terminator. Terminators can't not contain '\0' for text file. All this constrains are added in CreateTableStmt.java:analyzeRowFormat() and they are just used for text file. Logs: Field terminator is an empty sting [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by ""; Query: create table tb1(id int) row format delimited fields terminated by "" ERROR: AnalysisException: Field delimiter can't be an empty string [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by ''; Query: create table tb1(id int) row format delimited fields terminated by '' ERROR: AnalysisException: Field delimiter can't be an empty string Line terminator is the first byte of filed delimiter. If the tuple delimiter(lines terminate), (e.g. '#') is the same as first byte of field delimiter(e.g. "#@#"), according to code https://gerrit.cloudera.org/#/c/3314/2/be/src/exec/delimited-text-parser.cc@143 , given data "1#@#CLOUDERA#@#1#" and table schema (id int,name string,age int),the parsed result would be: id name age 1 NULL NULL (reach first '#', means tuple ends, so name and age are null) NULL NULL NULL ('@' can't turn into int, so id is NULL. name and age as above) NULL NULL NULL ('CLOUDERA' can't turn into int, so id is NULL, name and age as above) NULL NULL NULL ('@' can't turn into int, so id is NULL. name and age as above) 1 NULL NULL As above shows, the result means nothing. So in this commit, tuple delimiter can't be the same as first byte of field delimiter. According to code: https://gerrit.cloudera.org/#/c/3314/2/fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java@177 If we found that the tuple delimiter is the first byte of field delimiter, we would replace tuple delimiter with DEAULT_LINE_DELIM('\n'). Test shows that this make sence. [root@nobida147 workspace]# cat tuple_in_field_oneline.dat 1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@# [root@nobida147 workspace]# cat tuple_in_field_multiline.dat 1#@#CLOUDERA#@#1 2#@#IMPALA#@#2 3#@##@#3 #@#id null#@#4 5#@#age null#@# [nobida147:21000] > load data inpath 'hdfs://localhost:20500/user/root/tuple_in_field_oneline.dat' into table tuple_in_field; Query: load data inpath 'hdfs://localhost:20500/user/root/tuple_in_field_oneline.dat' into table tuple_in_field +----------------------------------------------------------+ | summary | +----------------------------------------------------------+ | Loaded 1 file(s). Total files in destination location: 1 | +----------------------------------------------------------+ Fetched 1 row(s) in 5.49s [nobida147:21000] > load data inpath 'hdfs://localhost:20500/user/root/tuple_in_field_multiline.dat' into table tuple_in_field; Query: load data inpath 'hdfs://localhost:20500/user/root/tuple_in_field_multiline.dat' into table tuple_in_field +----------------------------------------------------------+ | summary | +----------------------------------------------------------+ | Loaded 1 file(s). Total files in destination location: 2 | +----------------------------------------------------------+ Fetched 1 row(s) in 0.23s [nobida147:21000] > select * from tuple_in_field; Query: select * from tuple_in_field +------+----------+------+ | id | name | age | +------+----------+------+ | 1 | CLOUDERA | NULL | | 1 | CLOUDERA | 1 | | 2 | IMPALA | 2 | | 3 | | 3 | | NULL | id null | 4 | | 5 | age null | NULL | +------+----------+------+ WARNINGS: Error converting column: 2 TO INT (Data is: 1#2) file: hdfs://localhost:20500/test-warehouse/db3.db/tuple_in_field/tuple_in_field_oneline.dat record: 1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@# Fetched 6 row(s) in 0.74s For "1#@#CLOUDERA#@#1#2#@#IMPALA#@#2#3#@##@#3##@#id null#@#4#5#@#age null#@#", we have replace '#' with '\n' as tuple delimiter, so when we come to "1#2#@#", the first '#' wouldn't be parsed as tuple delimiter, therefor when we try turn "1#2" into an int column, warning occurs. After adding the constrain, an exception will be thrown to warn the user: [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '#'; Query: create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '#' ERROR: AnalysisException: Line delimiter can't be the first byte of field delimiter, lineDelim: #, fieldDelim: ## Escape character is the first byte of field delimiter. If escape character is the first byte of filed delimiter, when we get this character, we don't know whether is the escape character or the beginning of field delimiter. After adding the constrain, an exception will be thrown to warn the user: [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by '#\043' escaped by '#'; Query: create table tb1(id int) row format delimited fields terminated by '#\043' escaped by '#' ERROR: AnalysisException: Escape character can't be the first byte of field delimiter, escapeChar: #, fieldDelim: ## [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '#'; Query: create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '#' ERROR: AnalysisException: Line delimiter can't be the first byte of field delimiter, lineDelim: #, fieldDelim: ## Terminators including \0 I have try to create table with '\0' as field delimiter, or '\0' as escape character, but both failed. Log shows "ERROR: invalid byte sequence for encoding "UTF8": 0x00". Even if I use "--encoding=LATIN1" to init postgres db, the same error occurs. I was wondering whether you have tested these corner cases before? [nobida147:21000] > create table single_null(id int, name string, age int) row format delimited fields terminated by "\u0000"; Query: create table single_null(id int, name string, age int) row format delimited fields terminated by "\u0000" ERROR: ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:902) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) at com.sun.proxy.$Proxy0.createTable(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1469) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1502) at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99) at com.sun.proxy.$Proxy3.create_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9267) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9251) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) NestedThrowablesStackTrace: org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1078) at org.datanucleus.store.rdbms.scostore.JoinMapStore.putAll(JoinMapStore.java:220) at org.datanucleus.store.rdbms.mapping.java.MapMapping.postInsert(MapMapping.java:137) at org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:519) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143) at org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784) at org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2314) at org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObjectAsValue(PersistableMapping.java:567) at org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObject(PersistableMapping.java:326) at org.datanucleus.store.rdbms.fieldmanager.ParameterSetter.storeObjectField(ParameterSetter.java:193) at org.datanucleus.state.JDOStateManager.providedObjectField(JDOStateManager.java:1269) at org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoProvideField(MStorageDescriptor.java) at org.apache.hadoop.hive.metastore.model.MStorageDescriptor.jdoProvideFields(MStorageDescriptor.java) at org.datanucleus.state.JDOStateManager.provideFields(JDOStateManager.java:1346) at org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:289) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143) at org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784) at org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2314) at org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObjectAsValue(PersistableMapping.java:567) at org.datanucleus.store.rdbms.mapping.java.PersistableMapping.setObject(PersistableMapping.java:326) at org.datanucleus.store.rdbms.fieldmanager.ParameterSetter.storeObjectField(ParameterSetter.java:193) at org.datanucleus.state.JDOStateManager.providedObjectField(JDOStateManager.java:1269) at org.apache.hadoop.hive.metastore.model.MTable.jdoProvideField(MTable.java) at org.apache.hadoop.hive.metastore.model.MTable.jdoProvideFields(MTable.java) at org.datanucleus.state.JDOStateManager.provideFields(JDOStateManager.java:1346) at org.datanucleus.store.rdbms.request.InsertRequest.execute(InsertRequest.java:289) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertTable(RDBMSPersistenceHandler.java:167) at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.insertObject(RDBMSPersistenceHandler.java:143) at org.datanucleus.state.JDOStateManager.internalMakePersistent(JDOStateManager.java:3784) at org.datanucleus.state.JDOStateManager.makePersistent(JDOStateManager.java:3760) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2219) at org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065) at org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913) at org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:902) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) at com.sun.proxy.$Proxy0.createTable(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1469) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1502) at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99) at com.sun.proxy.$Proxy3.create_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9267) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$create_table_with_environment_context.getResult(ThriftHiveMetastore.java:9251) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) *-------------Attention here---------------------* *Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00* at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399) at org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069) ... 68 more [nobida147:21000] > create table single_null(id int, name string, age int) row format delimited lines terminated by '\0'; Query: create table single_null(id int, name string, age int) row format delimited lines terminated by '\0' ERROR: ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) . . . Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399) at org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069) ... 68 more [nobida147:21000] > create table single_null(id int, name string, age int) row format delimited escaped by '\0'; Query: create table single_null(id int, name string, age int) row format delimited escaped by '\0' ERROR: ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: CAUSED BY: MetaException: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO "SERDE_PARAMS" ("PARAM_VALUE","SERDE_ID","PARAM_KEY") VALUES (?,?,?) . . . . Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:334) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205) at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeUpdate(ParamLoggingPreparedStatement.java:399) at org.datanucleus.store.rdbms.SQLController.executeStatementUpdate(SQLController.java:439) at org.datanucleus.store.rdbms.scostore.JoinMapStore.internalPut(JoinMapStore.java:1069) ... 68 moreIt seems that the error occurs due to postgres. As for now, this commit doesn't support delimiters(filed, tuple or escape char) that including \0. After adding the constrain, it would throw an exception to warn the user: [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by '\0'; Query: create table tb1(id int) row format delimited fields terminated by '\0' ERROR: AnalysisException: Terminators can't contains \0 [nobida147:21000] > create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '\0'; Query: create table tb1(id int) row format delimited fields terminated by '#\043' lines terminated by '\0' ERROR: AnalysisException: Terminators can't contains \0 Mulit-byte field delimiter are also supported for ther file formats but barely tested. [nobida147:21000] > create table ccbn_par(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',,' escaped by '\\' lines terminated by '\n' stored as parquet; Query: create table ccbn_par(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',,' escaped by '\\' lines terminated by '\n' stored as parquet Fetched 0 row(s) in 0.14s [nobida147:21000] > insert into ccbn_par select * from ccbn; Query: insert into ccbn_par select * from ccbn Inserted 3 row(s) in 5.14s [nobida147:21000] > select * from ccbn_par; Query: select * from ccbn_par +--------------+--------------+------+------+ | col1 | col2 | col3 | col4 | +--------------+--------------+------+------+ | abc , abc | xyz \ xyz | 1 | 2 | | abc ,,, abc | xyz \\\ xyz | 3 | 4 | | abc \,\, abc | xyz ,\,\ xyz | 5 | 6 | +--------------+--------------+------+------+ Fetched 3 row(s) in 0.13s [nobida147:21000] > select * from ccbn; Query: select * from ccbn +--------------+--------------+------+------+ | col1 | col2 | col3 | col4 | +--------------+--------------+------+------+ | abc , abc | xyz \ xyz | 1 | 2 | | abc ,,, abc | xyz \\\ xyz | 3 | 4 | | abc \,\, abc | xyz ,\,\ xyz | 5 | 6 | +--------------+--------------+------+------+ Fetched 3 row(s) in 0.13s [nobida147:21000] > create table dhhp_par like dhhp stored as parquet; Query: create table dhhp_par like dhhp stored as parquet Fetched 0 row(s) in 0.13s [nobida147:21000] > insert into dhhp_par select * from dhhp; Query: insert into dhhp_par select * from dhhp Inserted 3 row(s) in 0.34s [nobida147:21000] > select * from dhhp; Query: select * from dhhp +--------------+--------------+------+------+ | col1 | col2 | col3 | col4 | +--------------+--------------+------+------+ | abc $ abc | xyz # xyz | 1 | 2 | | abc $$$ abc | xyz ### xyz | 3 | 4 | | abc #$#$ abc | xyz $#$# xyz | 5 | 6 | +--------------+--------------+------+------+ Fetched 3 row(s) in 0.13s [nobida147:21000] > select * from dhhp_par; Query: select * from dhhp_par +--------------+--------------+------+------+ | col1 | col2 | col3 | col4 | +--------------+--------------+------+------+ | abc $ abc | xyz # xyz | 1 | 2 | | abc $$$ abc | xyz ### xyz | 3 | 4 | | abc #$#$ abc | xyz $#$# xyz | 5 | 6 | +--------------+--------------+------+------+ Fetched 3 row(s) in 0.13s [nobida147:21000] > create table parquet_commacomma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',,' escaped by '\\' lines terminated by '\n' stored as parquet; Query: create table parquet_commacomma_backslash_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by ',,' escaped by '\\' lines terminated by '\n' stored as parquet Fetched 0 row(s) in 0.13s [nobida147:21000] > insert into parquet_commacomma_backslash_newline select *from text_commacomma_backslash_newline; Query: insert into parquet_commacomma_backslash_newline select *from text_commacomma_backslash_newline Inserted 5 row(s) in 3.40s [nobida147:21000] > select * from text_commacomma_backslash_newline; Query: select * from text_commacomma_backslash_newline +----------+------+------+------+ | col1 | col2 | col3 | col4 | +----------+------+------+------+ | one | two | 3 | 4 | | one,one | two | 3 | 4 | | one\ | two | 3 | 4 | | one\,one | two | 3 | 4 | | one\\ | two | 3 | 4 | +----------+------+------+------+ Fetched 5 row(s) in 0.13s [nobida147:21000] > select * from parquet_commacomma_backslash_newline; Query: select * from parquet_commacomma_backslash_newline +----------+------+------+------+ | col1 | col2 | col3 | col4 | +----------+------+------+------+ | one | two | 3 | 4 | | one,one | two | 3 | 4 | | one\ | two | 3 | 4 | | one\,one | two | 3 | 4 | | one\\ | two | 3 | 4 | +----------+------+------+------+ Fetched 5 row(s) in 0.13s [nobida147:21000] > create table parquet_hashathash_ecirc_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by '#@#' escaped by '-22' lines terminated by '\n' stored as parquet; Query: create table parquet_hashathash_ecirc_newline(col1 string, col2 string, col3 int, col4 int) row format delimited fields terminated by '#@#' escaped by '-22' lines terminated by '\n' stored as parquet Fetched 0 row(s) in 0.12s [nobida147:21000] > insert into parquet_hashathash_ecirc_newline select * from text_hashathash_ecirc_newline; Query: insert into parquet_hashathash_ecirc_newline select * from text_hashathash_ecirc_newline Inserted 5 row(s) in 5.07s [nobida147:21000] > select * from parquet_hashathash_ecirc_newline; Query: select * from parquet_hashathash_ecirc_newline +------------+------+------+------+ | col1 | col2 | col3 | col4 | +------------+------+------+------+ | one | two | 3 | 4 | | one#@#one | two | 3 | 4 | | one?1?7 | two | 3 | 4 | | one?1?7#@#one | two | 3 | 4 | | one?1?7?1?7 | two | 3 | 4 | +------------+------+------+------+ Fetched 5 row(s) in 0.13s [nobida147:21000] > select * from text_hashathash_ecirc_newline; Query: select * from text_hashathash_ecirc_newline +------------+------+------+------+ | col1 | col2 | col3 | col4 | +------------+------+------+------+ | one | two | 3 | 4 | | one#@#one | two | 3 | 4 | | one?1?7 | two | 3 | 4 | | one?1?7#@#one | two | 3 | 4 | | one?1?7?1?7 | two | 3 | 4 | +------------+------+------+------+ Fetched 5 row(s) in 0.13s Field terminator setting For now, you can use octal or ascii character to set filed delimiter. For example, If you wan't set "##" as filed delimiter, you can use: fields terminated by '\043#'. You can't use unicode, hexadecimal or decimal, '\u0023', '\x23' and '35' respectively. I can't find a solution to un-escape hexadecimal and decimal. And there's a bug in front-end SqlParser for unicode. I have open a issue in https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we can also use unicode to set filed terminator. ------------------ ???????? ------------------ ??????: "Jim Apple";<[email protected]>; ????????: 2016??7??22??(??????) ????6:22 ??????: "dev"<[email protected]>; ????: "??????"<[email protected]>; ????: Re: IMPALA-2428 Support multiple-character string as the field delimiter +cc:[email protected], in case they are not on the list. On Wed, Jul 13, 2016 at 6:19 PM, Jim Apple <[email protected]> wrote: > Can you please put your design here, rather than just telling us where > you pasted it? > > On Tue, Jul 12, 2016 at 4:33 AM, Yuanhao Luo > <[email protected]> wrote: >> Hello, everyone. To fix IMPALA-2428 , I have push a commit to gerrit. >> I have illustrated my design in detail in IMPALA-2428 and there are also >> some test logs. Please read them carefully. >> >> >> The key point is that there are four constrains on multi-byte field >> terminator as below: >> >> Field terminator can't be an empty string >> >> Line terminators can't be the first byte of field terminator. >> >> Escape character can't be the first byte of field terminator. >> >> Terminators can't not contain '\0' for text file. >> >> As suggested by Jim Apple, I start this discussion in mailing list to do a >> design review. >> Are there any problems in my design?
