liangchen-datanerd opened a new issue, #11002:
URL: https://github.com/apache/hudi/issues/11002
**problem**
the requirement was to extract date value as partition from event_time
column. According to the hudi offical doc the ingestion config for hoodie would
be like this
```
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
\
--hoodie-conf hoodie.keygen.timebased.timestamp.type="DATE_STRING" \
--hoodie-conf hoodie.keygen.timebased.input.dateformat="yyyy-MM-dd HH:mm:ss"
\
--hoodie-conf hoodie.keygen.timebased.output.dateformat="yyyy-MM-dd" \
```
the problem is that partition value was correct but when I query the table
the partition column would be the partition value not the original value. For
example the event_time is '2023-01-01 12:00:00' then partition value would be
2023-01-01. But when query hudi table the event_time would be 2023-01-01 not
the orginal value. But when I query the parquet file the event_time would be
orginal value.
**To Reproduce**
Steps to reproduce the behavior:
using pyspark shell.
```
pyspark \
--master spark://node1:7077 \
--packages
'org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469' \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
```
```
# Create a DataFrame
data = [("James", "Sales", "2023-01-02 12:12:23"),
("Michael", "Sales", "2023-01-01 12:12:23"),
("Robert", "Sales", "2023-01-02 01:12:23"),
("Maria", "Finance", "2023-01-01 01:15:23")]
df = spark.createDataFrame(data, ["employee_name", "department", "time"])
# Define Hudi options
hudi_options = {
"hoodie.table.name":"employee_hudi",
"hoodie.datasource.write.operation":"insert_overwrite_table",
"hoodie.datasource.write.recordkey.field":"employee_name",
"hoodie.datasource.write.partitionpath.field":"time:TIMESTAMP",
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator",
"hoodie.keygen.timebased.timestamp.type":"DATE_STRING",
"hoodie.keygen.timebased.input.dateformat":"yyyy-MM-dd HH:mm:ss",
"hoodie.keygen.timebased.output.dateformat":"yyyy-MM-dd"
}
# Write DataFrame to Hudi
df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save("s3a://hudi-warehouse/test/")
# query hudi table
spark.read.format("hudi") \
.option("hoodie.schema.on.read.enable","true") \
.load("s3a://hudi-warehouse/test/") \
.show(truncate=False)
# read parquet file\
spark.read.format("parquet") \
.load("s3a://hudi-warehouse/test/2023-01-01/ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet")
\
.show(truncate=False)
```
when I query hudi table the result:
```
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|employee_name|department|time |
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
|20240411142532923 |20240411142532923_1_0|James |2023-01-02
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James
|Sales |2023-01-02|
|20240411142532923 |20240411142532923_1_1|Robert |2023-01-02
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert
|Sales |2023-01-02|
|20240411142532923 |20240411142532923_0_0|Michael |2023-01-01
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
|Sales |2023-01-01|
|20240411142532923 |20240411142532923_0_1|Maria |2023-01-01
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
|Finance |2023-01-01|
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+----------+
```
when I read the parquet file the result:
```
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|employee_name|department|time
|
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
|20240411142532923 |20240411142532923_0_0|Michael |2023-01-01
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
|Sales |2023-01-01 12:12:23|
|20240411142532923 |20240411142532923_0_1|Maria |2023-01-01
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
|Finance |2023-01-01 01:15:23|
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+-------------+----------+-------------------+
```
**Expected behavior**
the hudi table should retrieve the original value for partition time column
not the transformed value.
**Environment Description**
I'm using local standalone Spark cluster
* Hudi version : 0.14.1.
* Spark version : 3.4
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Additional context**
I google it and found similar issues. it seems like the issue has not been
resolved
- [#10678](https://github.com/apache/hudi/issues/10678)
- [HUDI-3204](https://issues.apache.org/jira/browse/HUDI-3204)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]