[
https://issues.apache.org/jira/browse/HUDI-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Guo updated HUDI-7485:
----------------------------
Fix Version/s: 0.16.0
> Can't read the Hudi Table if using TimestampBasedKeyGenerator to write table
> ----------------------------------------------------------------------------
>
> Key: HUDI-7485
> URL: https://issues.apache.org/jira/browse/HUDI-7485
> Project: Apache Hudi
> Issue Type: Bug
> Components: reader-core
> Reporter: Aditya Goenka
> Priority: Critical
> Fix For: 0.15.0, 0.16.0
>
>
> Reading Hudi Table using TimestampBasedKeyGenerator and date format
> 'yyyy-MM-dd' giving Exception
> Exception
> ```
> Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to
> org.apache.spark.unsafe.types.UTF8String
> at
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
> at
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46)
> at
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at
> org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264)
> at
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
> ```
>
> Code-
> ```
> columns = ["ts","uuid","rider","driver","fare","dt"]
> data
> =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"2012-01-01"),
> (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70
> ,"2012-01-01"),
> (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90
> ,"2012-01-01"),
> (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"2012-01-01")]
> inserts = spark.createDataFrame(data).toDF(*columns)
> hudi_options = {
> 'hoodie.table.name': tableName,
> 'hoodie.datasource.write.recordkey.field' : 'uuid',
> 'hoodie.datasource.write.precombine.field' : 'ts',
> 'hoodie.datasource.write.partitionpath.field': 'dt',
> 'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled' :
> 'true',
> 'hoodie.datasource.write.keygenerator.class' :
> 'org.apache.hudi.keygen.TimestampBasedKeyGenerator',
> 'hoodie.keygen.timebased.timestamp.type' : 'SCALAR',
> 'hoodie.keygen.timebased.timestamp.scalar.time.unit' : 'DAYS',
> 'hoodie.keygen.timebased.input.dateformat' : 'yyyy-MM-dd',
> 'hoodie.keygen.timebased.output.dateformat' : 'yyyy-MM-dd',
> 'hoodie.keygen.timebased.timezone' : 'GMT+8:00',
> 'hoodie.datasource.write.hive_style_partitioning' : 'true',
> }
> # Insert data
> inserts.withColumn("dt", expr("CAST(dt as date)")).write.format("hudi"). \
> options(**hudi_options). \
> mode("overwrite"). \
> save(basePath)
> deleteDF=spark.read.format("hudi").load(basePath)
> deleteDF.show()
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)