[ 
https://issues.apache.org/jira/browse/HUDI-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496713#comment-17496713
 ] 

Istvan Darvas edited comment on HUDI-3490 at 2/23/22, 12:24 PM:
----------------------------------------------------------------

DeltaStreamer from Kafka/Json => S3/Hudi table

config:

  hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS

file based target schema:

{
 "name": "report_time",
  "type":

{   "type": "long",   "logicalType": "timestamp-millis"  }

},

—

sinked parquet file schema inspect: (parquet tools)

 
 # 
 ## 
 ### 
 #### 
 ##### 
 ###### 
 ####### 
 ######## 
 ######### 
 ########## 
 ########### 
 ############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

 

it seems it does not respect the config. neither the hoodie conf, nor the avro 
target conf.


was (Author: JIRAUSER282551):
DeltaStreamer from Kafka/Json => S3/Hudi table

config:

  hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS

file based target schema:

{
 "name": "report_time",
  "type": {
  "type": "long",
  "logicalType": "timestamp-millis"
 }
},

---

sinked parquet file schema inspect: (parquet tools)

 

############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

 

it seems it does not respect the config

> Timestamp conversion (parquet)
> ------------------------------
>
>                 Key: HUDI-3490
>                 URL: https://issues.apache.org/jira/browse/HUDI-3490
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Istvan Darvas
>            Priority: Major
>
> Hi Guys!
>  
> My Env is Hudi 0.8.0 AWS EMR 6.4
>  
> It seems timestamp conversion is very confusing and not deterministic across 
> the tools.
> 1.) for me it seems Delta Streamer default is TIMESTAMP_MILLIS
> 2.) PySpark/HUDI API is TIMESTAMP_MICROS
>  
> but the real issue for me is, I cannot control this.
>  
> Neither in DeltaStremer:
>  --hoodie-conf hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
> Nor in PySpark
> {"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"}
>  
> So I am not able to set a default for me accross systems. ofcourse I can 
> convert it myself and I will do it as a workaround, but it would be greate to 
> have this convenient feture.
>  
> One more suggestion / idea:
> I do not know it is possible or not, but maybe this parameter 
> (hoodie.parquet.outputtimestamptype) could be removed from everywhere, and 
> the framework could use the high level contract from the spark framework. 
> Wich is
>    spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS / TIMESTAMP_MICROS
>    the storage is INT96, which is not compatible with avro, but here I think 
> you could do some atomatic conversion which would be well documented :)
>  
> Summarized, I am confused and I am not able to use the automatic conversion 
> of the timestamps across the systems. So this should be standardized.
>  
> Thanks,
>  Darvi
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to