harichandanp opened a new issue, #11140: URL: https://github.com/apache/hudi/issues/11140
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I am getting a class cast exception (`java.lang.ClassCastException: class java.lang.Long cannot be cast to class org.apache.spark.unsafe.types.UTF8String`) when I try to read a Hudi dataset with a timestamp partition field **To Reproduce** Steps to reproduce the behavior: Generated dummy data as follows: ``` case class A(id: Long, ts: Long) scala> List(A(1L, 1713033644000L), A(2L, 1713033887000L), A(3L, 1713033890000L)).toDF .write .format("hudi") .option("hoodie.table.name", "dummy3") .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator") .option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "yyyyMMddHH") .option("hoodie.keygen.timebased.timestamp.type", "EPOCHMILLISECONDS") .option("hoodie.datasource.write.partitionpath.field", "ts") .option("hoodie.datasource.write.precombine.field", "ts") .option("hoodie.datasource.write.recordkey.field", "id") .mode("overwrite") .save("/tmp/dummy3") ``` When I try to read this data using Hudi 0.13.0+ (0.13.0 - 0.14.1), I get the following error: ``` $ /usr/local/spark-3.3.0-bin-hadoop3/bin/spark-shell --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' scala> spark.read.format("hudi").load("/tmp/dummy3").show 24/05/02 10:54:51 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 24/05/02 10:54:51 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file 24/05/02 10:54:56 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.ClassCastException: class java.lang.Long cannot be cast to class org.apache.spark.unsafe.types.UTF8String (java.lang.Long is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:269) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:280) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark33LegacyHoodieParquetFileFormat.scala:316) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:553) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` The schema of the data is: ``` scala> spark.read.format("hudi").load("/tmp/dummy3").printSchema root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: long (nullable = false) |-- ts: string (nullable = true) ``` The field is stored as long in the parquet file: ``` scala> spark.read.format("parquet").load("/tmp/dummy3/2024041311").printSchema root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: long (nullable = true) |-- ts: long (nullable = true) ``` **Expected behavior** I am able to read in the data when using Hudi 0.12.3 or below (tested till 0.9.0) ``` $ /usr/local/spark-3.3.0-bin-hadoop3/bin/spark-shell --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.3 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' scala> spark.read.format("hudi").load("/tmp/dummy3").show 24/05/02 10:55:26 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 24/05/02 10:55:26 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id| ts| +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+ | 20240502105345612|20240502105345612...| 2| 2024041311|cfcdc5ab-77da-4eb...| 2|2024041311| | 20240502105345612|20240502105345612...| 3| 2024041311|cfcdc5ab-77da-4eb...| 3|2024041311| | 20240502105345612|20240502105345612...| 1| 2024041311|cfcdc5ab-77da-4eb...| 1|2024041311| +-------------------+--------------------+------------------+----------------------+--------------------+---+----------+ ``` **Environment Description** * Hudi version : 0.13.0 and later (0.13.0, 0.13.1, 0.14.0, 0.14.1) * Spark version : 3.3.0 and 3.4.1 * Hive version : - * Hadoop version : - * Storage (HDFS/S3/GCS..) : Tested with Local Disk and S3 * Running on Docker? (yes/no) : no **Additional context** - **Stacktrace** ``` Caused by: java.lang.ClassCastException: class java.lang.Long cannot be cast to class org.apache.spark.unsafe.types.UTF8String (java.lang.Long is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String$(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:72) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:269) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:280) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark33LegacyHoodieParquetFileFormat.scala:316) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:553) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
