[ 
https://issues.apache.org/jira/browse/HUDI-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-7306:
------------------------------
        Parent: HUDI-8724
    Issue Type: Sub-task  (was: Bug)

> CustomKeyGenerator with TIMESTAMP skipping the microseconds part
> ----------------------------------------------------------------
>
>                 Key: HUDI-7306
>                 URL: https://issues.apache.org/jira/browse/HUDI-7306
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: writer-core
>            Reporter: Aditya Goenka
>            Priority: Critical
>             Fix For: 1.0.1
>
>
> CustomKeyGenerator is not able to parse the microseconds part. Below is the 
> reproducible code. In output its giving - "2023-03-04 14:44:42.046000"
> ```
> fake = Faker()
> data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
> "FullName": fake.name(), "Address": fake.address(),
> "CompanyName": fake.company(), "JobTitle": fake.job(),
> "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
> "RandomText": fake.sentence(), "City": fake.city(),
> "State": fake.state(), "Country": fake.country()} for _ in range(5)]
> pandas_df = pd.DataFrame(data)
> hoodie_properties = {
> 'hoodie.table.name': "pj_poc",
> 'hoodie.datasource.write.recordkey.field': 'ID',
> 'hoodie.datasource.write.partitionpath.field': 
> 'State:SIMPLE,Country:SIMPLE,EventTime:TIMESTAMP',
> 'hoodie.datasource.write.table.name': "pj_poc",
> 'hoodie.datasource.write.precombine.field': 'EventTime',
> 'hoodie.datasource.write.hive_style_partitioning':'true',
> 'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.CustomKeyGenerator',
> 'hoodie.keygen.timebased.input.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
> 'hoodie.keygen.timebased.output.dateformat':'yyyy-MM-dd HH:mm:ss.SSSSSS',
> 'hoodie.keygen.timebased.timestamp.type' : 'DATE_STRING',
> 'hoodie.keygen.timebased.timestamp.scalar.time.unit': 'MICROSECONDS',
> 'hoodie.parquet.outputtimestamptype': 'TIMESTAMP_MICROS',
> }
> spark.sparkContext.setLogLevel("WARN")
> df = spark.createDataFrame(pandas_df)
> df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
> spark.read.options(**hoodie_properties).format("hudi").load(PATH).select("_hoodie_partition_path",
>  "EventTime").show(10, False)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to