[I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

via GitHub Thu, 11 Apr 2024 19:49:19 -0700


liangchen-datanerd opened a new issue, #11002:
URL: https://github.com/apache/hudi/issues/11002


   
   **problem**
   
   the requirement was to extract date value as partition from event_time 
column. According to the hudi offical doc the ingestion config for hoodie would 
be like this
   ```
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
   --hoodie-conf hoodie.keygen.timebased.timestamp.type="DATE_STRING" \
   --hoodie-conf hoodie.keygen.timebased.input.dateformat="yyyy-MM-dd HH:mm:ss" 
\   
   --hoodie-conf hoodie.keygen.timebased.output.dateformat="yyyy-MM-dd" \
   ```
   the problem is that partition value was correct but when I query the table 
the partition column would be the partition value not the original value. For 
example the event_time is '2023-01-01 12:00:00' then partition value would be 
2023-01-01. But when query hudi table the event_time would be 2023-01-01 not 
the orginal value. But when I query the parquet file the event_time would be 
orginal value. 
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:   
   using pyspark shell. 
   ```
   pyspark \
   --master spark://node1:7077 \
   --packages 
'org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469' \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
   --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar  
   ```
   
   ```
   # Create a DataFrame
   data = [("James", "Sales", "2023-01-02 12:12:23"),
           ("Michael", "Sales", "2023-01-01 12:12:23"),
           ("Robert", "Sales", "2023-01-02 01:12:23"),
           ("Maria", "Finance", "2023-01-01 01:15:23")]
   df = spark.createDataFrame(data, ["employee_name", "department", "time"])
   
   # Define Hudi options
   hudi_options = {
       "hoodie.table.name":"employee_hudi",    
       "hoodie.datasource.write.operation":"insert_overwrite_table",
       "hoodie.datasource.write.recordkey.field":"employee_name",
       "hoodie.datasource.write.partitionpath.field":"time:TIMESTAMP",
       
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator",
       "hoodie.keygen.timebased.timestamp.type":"DATE_STRING",
       "hoodie.keygen.timebased.input.dateformat":"yyyy-MM-dd HH:mm:ss",
       "hoodie.keygen.timebased.output.dateformat":"yyyy-MM-dd"
   }
   
   
   # Write DataFrame to Hudi
   df.write.format("hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save("s3a://hudi-warehouse/test/")
   
   # query hudi table
   spark.read.format("hudi") \
           .option("hoodie.schema.on.read.enable","true") \
           .load("s3a://hudi-warehouse/test/") \
           .show(truncate=False)  
   
   # read parquet file\
   spark.read.format("parquet") \
           
.load("s3a://hudi-warehouse/test/2023-01-01/ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet")
 \
           .show(truncate=False)    
   ```
   
   when I query hudi table the result:
   ```
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                    
                                     |employee_name|department|time      |
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
   |20240411142532923  |20240411142532923_1_0|James             |2023-01-02     
       
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James
        |Sales     |2023-01-02|
   |20240411142532923  |20240411142532923_1_1|Robert            |2023-01-02     
       
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert
       |Sales     |2023-01-02|
   |20240411142532923  |20240411142532923_0_0|Michael           |2023-01-01     
       
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
      |Sales     |2023-01-01|
   |20240411142532923  |20240411142532923_0_1|Maria             |2023-01-01     
       
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
        |Finance   |2023-01-01|
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
   ```
   when I read the parquet file the result:  
   ```
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                    
                                     |employee_name|department|time             
  |
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
   |20240411142532923  |20240411142532923_0_0|Michael           |2023-01-01     
       
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
      |Sales     |2023-01-01 12:12:23|
   |20240411142532923  |20240411142532923_0_1|Maria             |2023-01-01     
       
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
        |Finance   |2023-01-01 01:15:23|
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
   ```   
   
   
   
   **Expected behavior**
   
   the hudi table should retrieve the original value for partition time column 
not the transformed value.
   
   **Environment Description**
   I'm using local standalone Spark cluster
   
   * Hudi version :  0.14.1. 
   
   * Spark version : 3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   I google it and found similar issues. it seems like the issue has not been 
resolved
   - [#10678](https://github.com/apache/hudi/issues/10678)
   - [HUDI-3204](https://issues.apache.org/jira/browse/HUDI-3204)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

Reply via email to