ofinchuk-bloomberg opened a new issue, #10678:
URL: https://github.com/apache/hudi/issues/10678

   
   Can't read a table which was created using TimestampBasedKeyGenerator Or 
CustomKeyGenerator for timestamp partition.
   Issue is that `ts` remains Long type, while _hoodie_partition_path is formed 
as a String, so Simple operation to read doesn't work and throws Exception
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   `
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   object SprkDemo {
   
       def main(args:Array[String]): Unit ={
   
           val spark = SparkSession.builder()
               .master("local[1]")
               .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
               .config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
               .appName("SparkByExample")
               .getOrCreate();
           
           import spark.implicits._
           spark.createDataset(List(("id1","name1", System.currentTimeMillis()),
           ("id2","name2",(System.currentTimeMillis()+1))
           ))
               .toDF("id","name","ts")
               .write
               .format("hudi")
               .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.CustomKeyGenerator")
               .option("hoodie.datasource.write.partitionpath.field", 
"ts:timestamp")
               .option("hoodie.datasource.write.recordkey.field", "id")
               .option("hoodie.datasource.write.precombined.field", "name")
               .option("hoodie.table.name", "hudi_cow2")
               .option("hoodie.keygen.timebased.timestamp.type", 
"EPOCHMILLISECONDS")
               .option("hoodie.keygen.timebased.output.dateformat", 
"yyyyMMdd-HH")
               .mode(SaveMode.Overwrite)
               .save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2")
   
           
spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*")
               .show()
           
           spark.read.format("hudi")
               .option("hoodie.schema.on.read.enable","true")
               .load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/")
               .show()
   
       }
   }
   `
   when reading parquet I see next data:
   `
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id| name|           ts|            date|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   |  20240214184652987|20240214184652987...|               id1|           
20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00|
   |  20240214184652987|20240214184652987...|               id2|           
20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   `
   
   
   
   **Expected behavior**
   
   Table should be read successfully into spark dataframe
   
   **Environment Description**
   
   I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment
   
   * Running on Docker? (yes/no) :no
   
   
   **Stacktrace**
   
   ```
   Exception in thread "main" java.lang.RuntimeException: Failed to cast value 
'20240214-18' to 'LongType' for partition column 'ts'
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69)
        at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280)
        at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264)
        at 
org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:364)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$7(BaseHoodieTableFileIndex.java:333)
        at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:336)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:216)
        at 
org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:219)
        at 
org.apache.hudi.HoodieFileIndex.getFileSlicesForPrunedPartitions(HoodieFileIndex.scala:282)
        at 
org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:211)
        at org.apache.hudi.HoodieFileIndex.listFiles(HoodieFileIndex.scala:151)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to