jamb2024 opened a new issue, #11144:
URL: https://github.com/apache/hudi/issues/11144
Hi.
I am developing a process to ingest data from my hdfs using Hudi. I want to
partition the data using a custom keygenerator class where the partition key
will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using
the function module to send the row to a partition or another.
The initial load is the following:
spark.read.option("mergeSchema","true").parquet("PATH").
withColumn("_hoodie_is_deleted", lit(false)).
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","CID").
option("hoodie.datasource.write.precombine.field","sequential_total").
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator").
option("hoodie.datasource.write.partitionpath.field","CID@12").
option("hoodie.datasource.write.drop.partition.columns","true").
mode(Overwrite).
save("/tmp/hudi2")
I have added the property hoodie.datasource.write.drop.partition.columns
because when I read the final path, hudi throws me the error: Cannot find
columns: 'CID@12' in the schema
But with this property, It does not work either. The error that appears is
the following:
org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch
schema from the table
at
org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at
org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at
org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at
org.apache.hudi.BaseFileOnlyRelation.<init>(BaseFileOnlyRelation.scala:69)
at
org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 63 elided
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]