[jira] [Comment Edited] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

Aditya Goenka (Jira) Wed, 28 Feb 2024 04:34:46 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821639#comment-17821639
 ]


Aditya Goenka edited comment on HUDI-4818 at 2/28/24 12:32 PM:
---------------------------------------------------------------

This is still open - 

One more GitHub issue reported -

 [https://github.com/apache/hudi/issues/10678]

 

Reproducible code - 

```
data = [("id1", "name1", time.time_ns()),
("id2", "name2", time.time_ns()+ 1)]

schema = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("ts", LongType(), True)
])

df = spark.createDataFrame(data, schema=schema)

(df.write.format("hudi")
.option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.CustomKeyGenerator")
.option("hoodie.datasource.write.partitionpath.field", "ts:timestamp")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.precombined.field", "name")
.option("hoodie.table.name", "hudi_cow2")
.option("hoodie.keygen.timebased.timestamp.type", "EPOCHMILLISECONDS")
.option("hoodie.keygen.timebased.output.dateformat", "yyyyMMdd-HH")
.mode("overwrite")
.save(PATH))

df_read_parquet = spark.read.parquet(PATH + "/5*")
df_read_parquet.show()

df_read_hudi = (spark.read.format("hudi")
.option("hoodie.schema.on.read.enable", "true")
.load(PATH))
df_read_hudi.show()
```


was (Author: JIRAUSER299651):
This is still open - 

One more GitHub issue reported with reproducible code - 
[https://github.com/apache/hudi/issues/10678]

> Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
> -----------------------------------------------------------
>
>                 Key: HUDI-4818
>                 URL: https://issues.apache.org/jira/browse/HUDI-4818
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>
> Currently using `CustomKeyGenerator` with the partition-path config 
> \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
> `LongType` for partition column `ts_ms`
>       at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
>       at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>       at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>       at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>       at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>       at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>       at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>       at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
>       at scala.Option.map(Option.scala:230)
>       at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
>       at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
>       at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
>       at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
>       at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>       at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>       at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>       at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>       at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>       at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>       at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>       at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
>  {code}
>  
> This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
> at XXX
> where it properly handles only `TimestampBasedKeyGenerator`s but not the 
> other key-generators that might be changing the data-type of the 
> partition-value as compared to the source partition-column (in this case it 
> has `ts` as a long in the source table schema, but it produces 
> partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

Reply via email to