[jira] [Updated] (HUDI-7485) Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table

Ethan Guo (Jira) Mon, 10 Jun 2024 18:32:32 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-7485:
----------------------------
    Fix Version/s: 0.16.0

> Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table
> ----------------------------------------------------------------------------
>
>                 Key: HUDI-7485
>                 URL: https://issues.apache.org/jira/browse/HUDI-7485
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: reader-core
>            Reporter: Aditya Goenka
>            Priority: Critical
>             Fix For: 0.15.0, 0.16.0
>
>
> Reading Hudi Table using TimestampBasedKeyGenerator and date format 
> 'yyyy-MM-dd' giving Exception 
> Exception 
> ```
> Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
>         at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)
>         at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
>         at 
> org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72)
>         at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
>         at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264)
>         at 
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314)
>         at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
> ```
>  
> Code- 
> ```
> columns = ["ts","uuid","rider","driver","fare","dt"]
> data 
> =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"2012-01-01"),
> (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70
>  ,"2012-01-01"),
> (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90
>  ,"2012-01-01"),
> (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"2012-01-01")]
> inserts = spark.createDataFrame(data).toDF(*columns)
> hudi_options = {
> 'hoodie.table.name': tableName,
> 'hoodie.datasource.write.recordkey.field' : 'uuid',
> 'hoodie.datasource.write.precombine.field' : 'ts',
> 'hoodie.datasource.write.partitionpath.field': 'dt',
> 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled' : 
> 'true',
> 'hoodie.datasource.write.keygenerator.class' : 
> 'org.apache.hudi.keygen.TimestampBasedKeyGenerator',
> 'hoodie.keygen.timebased.timestamp.type' : 'SCALAR',
> 'hoodie.keygen.timebased.timestamp.scalar.time.unit' : 'DAYS',
> 'hoodie.keygen.timebased.input.dateformat' : 'yyyy-MM-dd',
> 'hoodie.keygen.timebased.output.dateformat' : 'yyyy-MM-dd',
> 'hoodie.keygen.timebased.timezone' : 'GMT+8:00',
> 'hoodie.datasource.write.hive_style_partitioning' : 'true',
> }
> # Insert data
> inserts.withColumn("dt", expr("CAST(dt as date)")).write.format("hudi"). \
> options(**hudi_options). \
> mode("overwrite"). \
> save(basePath)
> deleteDF=spark.read.format("hudi").load(basePath)
> deleteDF.show()
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7485) Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table

Reply via email to