Istvan Darvas created HUDI-3490:
-----------------------------------
Summary: Timestamp conversion (parquet)
Key: HUDI-3490
URL: https://issues.apache.org/jira/browse/HUDI-3490
Project: Apache Hudi
Issue Type: Bug
Reporter: Istvan Darvas
Hi Guys!
My Env is Hudi 0.8.0 AWS EMR 6.4
It seems timestamp conversion is very confusing and not deterministic across
the tools.
1.) for me it seems Delta Streamer default is TIMESTAMP_MILLIS
2.) PySpark/HUDI API is TIMESTAMP_MICROS
but the real issue for me is, I cannot control this.
Neither in DeltaStremer:
--hoodie-conf hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
Nor in PySpark
{"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"}
So I am not able to set a default for me accross systems. ofcourse I can
convert it myself and I will do it as a workaround, but it would be greate to
have this convenient feture.
One more suggestion / idea:
I do not know it is possible or not, but maybe this parameter
(hoodie.parquet.outputtimestamptype) could be removed from everywhere, and the
framework could use the high level contract from the spark framework. Wich is
spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS / TIMESTAMP_MICROS
the storage is INT96, which is not compatible with avro, but here I think
you could do some atomatic conversion which would be well documented :)
Summarized, I am confused and I am not able to use the automatic conversion of
the timestamps across the systems. So this should be standardized.
Thanks,
Darvi
--
This message was sent by Atlassian Jira
(v8.20.1#820001)