firecast opened a new issue #879: Hive Sync Error when creating a table with partition URL: https://github.com/apache/incubator-hudi/issues/879 I have a bunch of data that I am writing to s3 and doing a Hive sync during the write process. Hudi version - `0.5.0-SNAPSHOT` from the master branch Hive version - `Hive 2.3.2-amzn-2` Spark version - `2.4.3` **Sample Data Frame** ``` | gender|comments| title| cc| ip_address|last_name| id| birthdate| salary| registration_dttm| country| email|first_name| key | timestamp | date | +-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+ | Female| 1E+02| Internal Auditor|6759521864920116| 1.197.201.2| Jordan| 1| 3/8/1971| 49756.53|2016-02-03T07:55:29Z| Indonesia| [email protected]| Amanda|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Male| | Accountant IV| |218.111.175.34| Freeman| 2| 1/16/1968|150280.17|2016-02-03T17:04:03Z| Canada| [email protected]| Albert|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Female| | Structural Engineer|6767119071901597| 7.161.136.94| Morgan| 3| 2/1/1960|144972.51|2016-02-03T01:09:31Z| Russia|emorgan2@altervis...| Evelyn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Female| |Senior Cost Accou...|3576031598965625| 140.35.109.83| Riley| 4| 4/8/1997| 90263.05|2016-02-03T12:36:21Z| China| [email protected]| Denise|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | | | |5602256255204850|169.113.235.40| Burns| 5| | |2016-02-03T05:05:31Z| South Africa|cburns4@miitbeian...| Carlos|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| |Transgender| | Account Executive|3583136326049310|195.131.81.179| White| 6| 2/25/1983| 69227.11|2016-02-03T07:22:34Z| Indonesia| [email protected]| Kathryn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Unknown| |Senior Financial ...|3582641366974690|232.234.81.197| Holmes| 7|12/18/1987| 14247.62|2016-02-03T08:33:08Z| Portugal|[email protected]| Samuel|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Secret| | Web Developer IV| | 91.235.51.73| Howell| 8| 3/1/1962|186469.43|2016-02-03T06:47:06Z|Bosnia and Herzeg...| [email protected]| Harry|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Obvious| 1E+02|Software Test Eng...| | 132.31.53.61| Foster| 9| 3/27/1992|231067.84|2016-02-03T03:52:53Z| South Korea| [email protected]| Jose|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| | Female| | Health Coach IV|3574254110301671|143.28.251.245| Stewart| 10| 1/28/1997| 27234.28|2016-02-03T18:29:47Z| Nigeria|estewart9@opensou...| Emily|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...| 2019/09/04| +-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+ ``` **Spark Scala Code** ```scala cleanedDF .write.format("org.apache.hudi") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp") .option(HoodieWriteConfig.TABLE_NAME, catalogName) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "date") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, catalogName) .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, sparkConfig.hiveJDBCUri) .option("path", basePath) .mode(SaveMode.Append) .save() ``` **Error** ``` 837579 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. 837580 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HiveSyncTool - Table hudi_test is not found. Creating it 837594 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HoodieHiveClient - Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi' 837602 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO org.apache.hudi.hive.HoodieHiveClient - Executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi' 837698 [stream execution thread for [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420]] ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution - Query [id = 2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = a6d04a3e-1316-407f-a41e-1b24822b3420] terminated with error org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, `comments` string, `title` string, `cc` string, `ip_address` string, `last_name` string, `id` bigint, `birthdate` string, `salary` string, `registration_dttm` string, `country` string, `email` string, `first_name` string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<some-bucket>/catalogs/hudi_test/hudi' at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:467) at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:265) at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68) at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)\ at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253) at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253) at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:465) ... 53 more Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:522 cannot recognize input near 'date' 'string' ')' in column specification at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:522 cannot recognize input near 'date' 'string' ')' in column specification at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204) ... 27 more ``` As far as I understand, Hive does not allow you to add partitions to tables during `CREATE`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
