SamarthRaval opened a new issue, #11277:
URL: https://github.com/apache/hudi/issues/11277
**Describe the problem you faced**
- I did bulk-insert operation for my data, which ran fine, but for incoming
files I did insert operation [For incoming data there were few columns missing
and few new columns added] but as per my understanding hudi could have handled
that.
**To Reproduce**
Steps to reproduce the behavior:
1. Did bulk-insert loaded dataset.
2. Ran insert operation ran into - Failed insert schema compatibility check
3. when done step-2 with bulk-insert it ran fine and expanded the schema.
4. Adding configurations I used for it.
```
ImmutableMap.Builder<String, String> hudiOptions =
ImmutableMap.<String, String>builder()
.put("hoodie.table.name", tableName)
.put("hoodie.datasource.write.recordkey.field", "uniqueId")
.put("hoodie.datasource.write.precombine.field", "version")
.put("hoodie.datasource.write.table.type",
HoodieTableType.COPY_ON_WRITE.name())
.put("hoodie.datasource.write.operation", operation)
.put("hoodie.combine.before.insert", "true")
.put("hoodie.datasource.write.keygenerator.class",
SimpleKeyGenerator.class.getName())
.put("hoodie.bulkinsert.sort.mode", "GLOBAL_SORT")
.put("hoodie.copyonwrite.record.size.estimate", "50")
.put("hoodie.parquet.small.file.limit", "104857600")
.put("hoodie.parquet.max.file.size", "125829120")
.put("hoodie.write.set.null.for.missing.columns", "true")
.put("hoodie.datasource.write.reconcile.schema", "true")
.put("hoodie.datasource.write.partitionpath.field",
PARTITION_COLUMN_NAME)
.put("hoodie.datasource.hive_sync.partition_fields",
PARTITION_COLUMN_NAME)
.put("hoodie.datasource.hive_sync.enable", "true")
.put("hoodie.datasource.write.hive_style_partitioning",
"true")
.put("hoodie.datasource.hive_sync.table", tableName)
.put("hoodie.datasource.hive_sync.database", hudiDatabase)
.put("hoodie.datasource.hive_sync.auto_create_database",
"true")
.put("hoodie.datasource.hive_sync.support_timestamp", "true")
.put("hoodie.datasource.hive_sync.use_jdbc", "false")
.put("hoodie.datasource.hive_sync.mode", "hms")
.put("hoodie.datasource.hive_sync.partition_extractor_class",
MultiPartKeysValueExtractor.class.getName())
.put("hoodie.metadata.enable", "true")
.put("hoodie.meta.sync.metadata_file_listing", "true")
.put("hoodie.clean.automatic", "true")
.put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS")
.put("hoodie.cleaner.commits.retained", "30")
.put("hoodie.cleaner.parallelism", "1000")
.put("hoodie.archive.merge.enable", "true")
.put("hoodie.commits.archival.batch", "30")
.put("hoodie.write.concurrency.mode",
"OPTIMISTIC_CONCURRENCY_CONTROL")
.put("hoodie.cleaner.policy.failed.writes", "LAZY")
.put("hoodie.write.concurrency.early.conflict.detection.enable", "true")
.put("hoodie.write.lock.provider",
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider")
.put("hoodie.write.lock.dynamodb.table", hudiLockTable)
.put("hoodie.write.lock.dynamodb.partition_key",
warehouseTableName)
.put("hoodie.write.lock.dynamodb.region",
AWSUtils.getCurrentRegion().getName())
.put("hoodie.write.lock.dynamodb.endpoint_url",
String.format("dynamodb.%s.amazonaws.com",
AWSUtils.getCurrentRegion().getName()))
.put("hoodie.write.lock.dynamodb.billing_mode",
"PAY_PER_REQUEST");
if (operation.equals("insert"))
{
hudiOptions.put("hoodie.datasource.write.insert.drop.duplicates", "true");
}
```
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : 0.14.0
* Spark version : 3.4.1
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : s3
* Running on Docker? (yes/no) : No
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```24/05/22 19:44:10 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.hudi.exception.HoodieInsertException:
Failed insert schema compatibility check
at
org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:868)
at
org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:165)
at
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:218)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:504)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:502)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:113)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:129)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:165)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:165)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:276)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:164)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:503)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:503)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:33)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:479)
at
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:101)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:86)
at
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:151)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1075)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1167)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1176)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hudi.exception.HoodieException: Failed to read
schema/check compatibility for base path <S3 path>
at
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:844)
at
org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:866)
... 60 more
Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Column
dropping is not allowed
all schema comparisions
at
org.apache.hudi.avro.AvroSchemaUtils.checkSchemaCompatible(AvroSchemaUtils.java:373)
at
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:842)
... 61 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]