[
https://issues.apache.org/jira/browse/HUDI-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496713#comment-17496713
]
Istvan Darvas edited comment on HUDI-3490 at 2/23/22, 12:24 PM:
----------------------------------------------------------------
DeltaStreamer from Kafka/Json => S3/Hudi table
config:
hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS
file based target schema:
{
"name": "report_time",
"type":
{ "type": "long", "logicalType": "timestamp-millis" }
},
—
sinked parquet file schema inspect: (parquet tools)
#
##
###
####
#####
######
#######
########
#########
##########
###########
############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
it seems it does not respect the config. neither the hoodie conf, nor the avro
target conf.
was (Author: JIRAUSER282551):
DeltaStreamer from Kafka/Json => S3/Hudi table
config:
hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS
file based target schema:
{
"name": "report_time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
---
sinked parquet file schema inspect: (parquet tools)
############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds,
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
it seems it does not respect the config
> Timestamp conversion (parquet)
> ------------------------------
>
> Key: HUDI-3490
> URL: https://issues.apache.org/jira/browse/HUDI-3490
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Istvan Darvas
> Priority: Major
>
> Hi Guys!
>
> My Env is Hudi 0.8.0 AWS EMR 6.4
>
> It seems timestamp conversion is very confusing and not deterministic across
> the tools.
> 1.) for me it seems Delta Streamer default is TIMESTAMP_MILLIS
> 2.) PySpark/HUDI API is TIMESTAMP_MICROS
>
> but the real issue for me is, I cannot control this.
>
> Neither in DeltaStremer:
> --hoodie-conf hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
> Nor in PySpark
> {"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"}
>
> So I am not able to set a default for me accross systems. ofcourse I can
> convert it myself and I will do it as a workaround, but it would be greate to
> have this convenient feture.
>
> One more suggestion / idea:
> I do not know it is possible or not, but maybe this parameter
> (hoodie.parquet.outputtimestamptype) could be removed from everywhere, and
> the framework could use the high level contract from the spark framework.
> Wich is
> spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS / TIMESTAMP_MICROS
> the storage is INT96, which is not compatible with avro, but here I think
> you could do some atomatic conversion which would be well documented :)
>
> Summarized, I am confused and I am not able to use the automatic conversion
> of the timestamps across the systems. So this should be standardized.
>
> Thanks,
> Darvi
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)