Aditya Goenka created HUDI-7306:
-----------------------------------

             Summary: CustomKeyGenerator with TIMESTAMP skipping the 
microseconds part
                 Key: HUDI-7306
                 URL: https://issues.apache.org/jira/browse/HUDI-7306
             Project: Apache Hudi
          Issue Type: Bug
          Components: writer-core
            Reporter: Aditya Goenka
             Fix For: 1.1.0


CustomKeyGenerator is not able to parse the microseconds part. Below is the 
reproducible code. In output its giving - "2023-03-04 14:44:42.046000"
```
fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "City": fake.city(),
"State": fake.state(), "Country": fake.country()} for _ in range(5)]
pandas_df = pd.DataFrame(data)

hoodie_properties = {
'hoodie.table.name': "pj_poc",
'hoodie.datasource.write.recordkey.field': 'ID',
'hoodie.datasource.write.partitionpath.field': 
'State:SIMPLE,Country:SIMPLE,EventTime:TIMESTAMP',
'hoodie.datasource.write.table.name': "pj_poc",
'hoodie.datasource.write.precombine.field': 'EventTime',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.keygen.timebased.input.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
'hoodie.keygen.timebased.output.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
'hoodie.keygen.timebased.timestamp.type' : 'DATE_STRING',
'hoodie.keygen.timebased.timestamp.scalar.time.unit': 'MICROSECONDS',
'hoodie.parquet.outputtimestamptype': 'TIMESTAMP_MICROS',
}

spark.sparkContext.setLogLevel("WARN")

df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
spark.read.options(**hoodie_properties).format("hudi").load(PATH).select("_hoodie_partition_path",
 "EventTime").show(10, False)
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to