Aditya Goenka created HUDI-7306:
-----------------------------------
Summary: CustomKeyGenerator with TIMESTAMP skipping the
microseconds part
Key: HUDI-7306
URL: https://issues.apache.org/jira/browse/HUDI-7306
Project: Apache Hudi
Issue Type: Bug
Components: writer-core
Reporter: Aditya Goenka
Fix For: 1.1.0
CustomKeyGenerator is not able to parse the microseconds part. Below is the
reproducible code. In output its giving - "2023-03-04 14:44:42.046000"
```
fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "City": fake.city(),
"State": fake.state(), "Country": fake.country()} for _ in range(5)]
pandas_df = pd.DataFrame(data)
hoodie_properties = {
'hoodie.table.name': "pj_poc",
'hoodie.datasource.write.recordkey.field': 'ID',
'hoodie.datasource.write.partitionpath.field':
'State:SIMPLE,Country:SIMPLE,EventTime:TIMESTAMP',
'hoodie.datasource.write.table.name': "pj_poc",
'hoodie.datasource.write.precombine.field': 'EventTime',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.keygen.timebased.input.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
'hoodie.keygen.timebased.output.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
'hoodie.keygen.timebased.timestamp.type' : 'DATE_STRING',
'hoodie.keygen.timebased.timestamp.scalar.time.unit': 'MICROSECONDS',
'hoodie.parquet.outputtimestamptype': 'TIMESTAMP_MICROS',
}
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
spark.read.options(**hoodie_properties).format("hudi").load(PATH).select("_hoodie_partition_path",
"EventTime").show(10, False)
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)