[GitHub] [incubator-hudi] firecast opened a new issue #879: Hive Sync Error when creating a table with partition

GitBox Thu, 05 Sep 2019 02:40:21 -0700

firecast opened a new issue #879: Hive Sync Error when creating a table with 
partition
URL: https://github.com/apache/incubator-hudi/issues/879
 
 
   I have a bunch of data that I am writing to s3 and doing a Hive sync during 
the write process.
   
   Hudi version - `0.5.0-SNAPSHOT` from the master branch
   Hive version - `Hive 2.3.2-amzn-2`
   Spark version - `2.4.3`
   
   **Sample Data Frame**
   ```
   |     gender|comments|               title|              cc|    
ip_address|last_name| id| birthdate|   salary|   registration_dttm|             
country|               email|first_name|       key          | timestamp         
 |   date       |
   
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+
   |     Female|   1E+02|    Internal Auditor|6759521864920116|   1.197.201.2|  
 Jordan|  1|  3/8/1971| 49756.53|2016-02-03T07:55:29Z|           Indonesia|    
[email protected]|    Amanda|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |       Male|        |       Accountant IV|                |218.111.175.34|  
Freeman|  2| 1/16/1968|150280.17|2016-02-03T17:04:03Z|              Canada|     
[email protected]|    Albert|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |     Female|        | Structural Engineer|6767119071901597|  7.161.136.94|  
 Morgan|  3|  2/1/1960|144972.51|2016-02-03T01:09:31Z|              
Russia|emorgan2@altervis...|    Evelyn|aHVkaV8wXzRfN190Z...|2019-09-04 
14:28:...|    2019/09/04|
   |     Female|        |Senior Cost Accou...|3576031598965625| 140.35.109.83|  
  Riley|  4|  4/8/1997| 90263.05|2016-02-03T12:36:21Z|               China|    
[email protected]|    Denise|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |           |        |                    |5602256255204850|169.113.235.40|  
  Burns|  5|          |         |2016-02-03T05:05:31Z|        South 
Africa|cburns4@miitbeian...|    Carlos|aHVkaV8wXzRfN190Z...|2019-09-04 
14:28:...|    2019/09/04|
   |Transgender|        |   Account Executive|3583136326049310|195.131.81.179|  
  White|  6| 2/25/1983| 69227.11|2016-02-03T07:22:34Z|           Indonesia|  
[email protected]|   Kathryn|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |    Unknown|        |Senior Financial ...|3582641366974690|232.234.81.197|  
 Holmes|  7|12/18/1987| 14247.62|2016-02-03T08:33:08Z|            
Portugal|[email protected]|    Samuel|aHVkaV8wXzRfN190Z...|2019-09-04 
14:28:...|    2019/09/04|
   |     Secret|        |    Web Developer IV|                |  91.235.51.73|  
 Howell|  8|  3/1/1962|186469.43|2016-02-03T06:47:06Z|Bosnia and Herzeg...| 
[email protected]|     Harry|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |    Obvious|   1E+02|Software Test Eng...|                |  132.31.53.61|  
 Foster|  9| 3/27/1992|231067.84|2016-02-03T03:52:53Z|         South Korea|   
[email protected]|      Jose|aHVkaV8wXzRfN190Z...|2019-09-04 14:28:...|    
2019/09/04|
   |     Female|        |     Health Coach IV|3574254110301671|143.28.251.245|  
Stewart| 10| 1/28/1997| 27234.28|2016-02-03T18:29:47Z|             
Nigeria|estewart9@opensou...|     Emily|aHVkaV8wXzRfN190Z...|2019-09-04 
14:28:...|    2019/09/04|
   
+-----------+--------+--------------------+----------------+--------------+---------+---+----------+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+--------------+
   ```
   
   **Spark Scala Code**
   ```scala
   cleanedDF
       .write.format("org.apache.hudi")
       .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key")
       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date")
       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
       .option(HoodieWriteConfig.TABLE_NAME, catalogName)
       .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
       .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "date")
       .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, catalogName)
       .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, sparkConfig.hiveJDBCUri)
       .option("path", basePath)
       .mode(SaveMode.Append)
       .save()
   ```
   
   **Error**
   ```
   837579 [stream execution thread for [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420]] WARN  
com.amazonaws.services.s3.internal.S3AbortableInputStream  - Not all bytes were 
read from the S3ObjectInputStream, aborting HTTP connection. This is likely an 
error and may result in sub-optimal behavior. Request only the bytes you need 
via a ranged GET or drain the input stream after use.
   837580 [stream execution thread for [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  org.apache.hudi.hive.HiveSyncTool  
- Table hudi_test is not found. Creating it
   837594 [stream execution thread for [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  
org.apache.hudi.hive.HoodieHiveClient  - Creating table with CREATE EXTERNAL 
TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, 
`comments` string, `title` string, `cc` string, `ip_address` string, 
`last_name` string, `id` bigint, `birthdate` string, `salary` string, 
`registration_dttm` string, `country` string, `email` string, `first_name` 
string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 
's3a://<some-bucket>/catalogs/hudi_test/hudi'
   837602 [stream execution thread for [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420]] INFO  
org.apache.hudi.hive.HoodieHiveClient  - Executing SQL CREATE EXTERNAL TABLE  
IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, 
`comments` string, `title` string, `cc` string, `ip_address` string, 
`last_name` string, `id` bigint, `birthdate` string, `salary` string, 
`registration_dttm` string, `country` string, `email` string, `first_name` 
string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 
's3a://<some-bucket>/catalogs/hudi_test/hudi'
   837698 [stream execution thread for [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420]] ERROR 
org.apache.spark.sql.execution.streaming.MicroBatchExecution  - Query [id = 
2bb17a39-4a86-4952-8378-0d431b1ba74f, runId = 
a6d04a3e-1316-407f-a41e-1b24822b3420] terminated with error
   org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE 
EXTERNAL TABLE  IF NOT EXISTS default.hudi_test( `_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `gender` string, 
`comments` string, `title` string, `cc` string, `ip_address` string, 
`last_name` string, `id` bigint, `birthdate` string, `salary` string, 
`registration_dttm` string, `country` string, `email` string, `first_name` 
string, `key` string, `timestamp` bigint) PARTITIONED BY (date string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 
's3a://<some-bucket>/catalogs/hudi_test/hudi'
        at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:467)
        at 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:265)
        at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)\
        at 
org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
        at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
compiling statement: FAILED: ParseException line 1:522 cannot recognize input 
near 'date' 'string' ')' in column specification
        at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
        at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
        at 
org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
        at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
        at 
org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
        at 
org.apache.hudi.org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
        at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:465)
        ... 53 more
   Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
compiling statement: FAILED: ParseException line 1:522 cannot recognize input 
near 'date' 'string' ')' in column specification
        at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
        at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
        at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
        at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
        at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
        at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
        at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
        at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
        at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
        at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source)
        at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
        at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
        at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
        at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.parse.ParseException:line 1:522 cannot recognize 
input near 'date' 'string' ')' in column specification
        at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)
        at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
        at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
        at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
        at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
        ... 27 more
   ```
   
   As far as I understand, Hive does not allow you to add partitions to tables 
during `CREATE`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] firecast opened a new issue #879: Hive Sync Error when creating a table with partition

Reply via email to