[I] [SUPPORT] Can't read a table with timestamp based partition key generator [hudi]

via GitHub Thu, 15 Feb 2024 10:47:21 -0800


ofinchuk-bloomberg opened a new issue, #10678:
URL: https://github.com/apache/hudi/issues/10678


   
   Can't read a table which was created using TimestampBasedKeyGenerator Or 
CustomKeyGenerator for timestamp partition.
   Issue is that `ts` remains Long type, while _hoodie_partition_path is formed 
as a String, so Simple operation to read doesn't work and throws Exception
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   `
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   object SprkDemo {
   
       def main(args:Array[String]): Unit ={
   
           val spark = SparkSession.builder()
               .master("local[1]")
               .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
               .config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
               .appName("SparkByExample")
               .getOrCreate();
           
           import spark.implicits._
           spark.createDataset(List(("id1","name1", System.currentTimeMillis()),
           ("id2","name2",(System.currentTimeMillis()+1))
           ))
               .toDF("id","name","ts")
               .write
               .format("hudi")
               .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.CustomKeyGenerator")
               .option("hoodie.datasource.write.partitionpath.field", 
"ts:timestamp")
               .option("hoodie.datasource.write.recordkey.field", "id")
               .option("hoodie.datasource.write.precombined.field", "name")
               .option("hoodie.table.name", "hudi_cow2")
               .option("hoodie.keygen.timebased.timestamp.type", 
"EPOCHMILLISECONDS")
               .option("hoodie.keygen.timebased.output.dateformat", 
"yyyyMMdd-HH")
               .mode(SaveMode.Overwrite)
               .save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2")
   
           
spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*")
               .show()
           
           spark.read.format("hudi")
               .option("hoodie.schema.on.read.enable","true")
               .load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/")
               .show()
   
       }
   }
   `
   when reading parquet I see next data:
   `
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id| name|           ts|            date|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   |  20240214184652987|20240214184652987...|               id1|           
20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00|
   |  20240214184652987|20240214184652987...|               id2|           
20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
   `
   
   
   
   **Expected behavior**
   
   Table should be read successfully into spark dataframe
   
   **Environment Description**
   
   I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment
   
   * Running on Docker? (yes/no) :no
   
   
   **Stacktrace**
   
   ```
   Exception in thread "main" java.lang.RuntimeException: Failed to cast value 
'20240214-18' to 'LongType' for partition column 'ts'
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69)
        at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280)
        at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264)
        at 
org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:364)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$7(BaseHoodieTableFileIndex.java:333)
        at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:336)
        at 
org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:216)
        at 
org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:219)
        at 
org.apache.hudi.HoodieFileIndex.getFileSlicesForPrunedPartitions(HoodieFileIndex.scala:282)
        at 
org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:211)
        at org.apache.hudi.HoodieFileIndex.listFiles(HoodieFileIndex.scala:151)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Can't read a table with timestamp based partition key generator [hudi]

Reply via email to