novakov-alexey opened a new issue #3617:
URL: https://github.com/apache/hudi/issues/3617
**To Reproduce**
I am getting an exception on Hive Sync to Glue Catalog, when writing empty
dataframe with schema to S3.
Steps to reproduce the behavior:
1. Create empty dataframe with schema, for example:
```scala
val schema: StructType = ???
val df = session.createDataFrame(session.sparkContext.emptyRDD[Row], schema)
```
2. Write this dataframe to S3 with Hive sync enabled. For example:
```scala
df.write.format("hudi")
.options(options)
.mode(SaveMode.Overwrite)
.save("s3://my-datalake/my-table...")
```
My Hudi writer options:
```scala
val initLoadConfig = Map(
BULKINSERT_PARALLELISM -> "4",
INSERT_PARALLELISM -> "4",
UPSERT_PARALLELISM -> "4",
DELETE_PARALLELISM -> "4"
)
val unpartitionDataConfig = Map(
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
"org.apache.hudi.hive.NonPartitionedExtractor",
KEYGENERATOR_CLASS_PROP ->
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"
)
def writerOptions(
table: String,
primaryKey: String,
database: String,
) = {
Map(
OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
PRECOMBINE_FIELD_PROP -> "some field here",
RECORDKEY_FIELD_OPT_KEY -> primaryKey,
TABLE_NAME -> table,
"hoodie.consistency.check.enabled" -> "true",
ENABLE_ROW_WRITER_OPT_KEY -> "true",
HIVE_USE_JDBC_OPT_KEY -> "true",
HIVE_SYNC_ENABLED_OPT_KEY -> "true",
HIVE_SUPPORT_TIMESTAMP -> "true",
HIVE_DATABASE_OPT_KEY -> database,
HIVE_TABLE_OPT_KEY -> table
) ++ initLoadConfig ++ unpartitionDataConfig
}
```
**Expected behavior**
Hudi table is registered in AWS Glue Catalog as external table.
**Environment Description**
* Hudi version : 0.7
* Spark version : 3.1.1
* Hive version : AWS Glue Catalog
* Hadoop version : EMR 6.3.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Stacktrace**
```bash
21/09/07 09:48:47 WARN HiveSyncTool: Set partitionFields to empty, since the
NonPartitionedExtractor is used
21/09/07 09:48:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
org.apache.hudi.sync.common.HoodieSyncException: Failed to read data schema
at
org.apache.hudi.sync.common.AbstractSyncHoodieClient.getDataSchema(AbstractSyncHoodieClient.java:121)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:134)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:355)
at
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4(HoodieSparkSqlWriter.scala:403)
at
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4$adapted(HoodieSparkSqlWriter.scala:399)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
at
org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:311)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
at
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]