Istvan Darvas created HUDI-3490:
-----------------------------------

             Summary: Timestamp conversion (parquet)
                 Key: HUDI-3490
                 URL: https://issues.apache.org/jira/browse/HUDI-3490
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Istvan Darvas


Hi Guys!

 

My Env is Hudi 0.8.0 AWS EMR 6.4

 

It seems timestamp conversion is very confusing and not deterministic across 
the tools.

1.) for me it seems Delta Streamer default is TIMESTAMP_MILLIS

2.) PySpark/HUDI API is TIMESTAMP_MICROS

 

but the real issue for me is, I cannot control this.

 

Neither in DeltaStremer:

 --hoodie-conf hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS

Nor in PySpark

{"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"}

 

So I am not able to set a default for me accross systems. ofcourse I can 
convert it myself and I will do it as a workaround, but it would be greate to 
have this convenient feture.

 

One more suggestion / idea:

I do not know it is possible or not, but maybe this parameter 
(hoodie.parquet.outputtimestamptype) could be removed from everywhere, and the 
framework could use the high level contract from the spark framework. Wich is

   spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS / TIMESTAMP_MICROS

   the storage is INT96, which is not compatible with avro, but here I think 
you could do some atomatic conversion which would be well documented :)

 

Summarized, I am confused and I am not able to use the automatic conversion of 
the timestamps across the systems. So this should be standardized.

 

Thanks,

 Darvi

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to