fireking77 opened a new issue #4205:
URL: https://github.com/apache/hudi/issues/4205


   **Describe the problem you faced**
   Hi Guys,
   
   My config:
    Amazon EMR 6.4 with Hudi 0.8
   
   I am very newbie with HUDI, and when I checked the materialized parquet 
schema, I found some intresting in the metadata config.
   Maybe this is not HUDI related. It is related to the parquet writer class 
which is used.
   
   example field in the HUDI materialized parquet file:
   `{
         "Field_id": 0,
         "Name": "correlation_id",
         "Type": "BYTE_ARRAY",
         "Type_length": 0,
         "LogicalType": null,
         "Scale": 0,
         "Precision": 0,
         "Repetition_type": "REQUIRED",
         "Converted_type": "UTF8"
       }`
   
   And I think this would be the right one:
   
   `    {
         "Field_id": 0,
         "Name": "correlation_id",
         "Type": "BYTE_ARRAY",
         "Type_length": 0,
         "LogicalType": "LogicalType(STRING: StringType())",
         "Scale": 0,
         "Precision": 0,
         "Repetition_type": "REQUIRED",
         "Converted_type": "UTF8"
       }`
   
   It seems LogicalType always _null_ but the avro schema is OK wich also 
included in the parquet file.
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
   
   parquet schema with a time field:
   `{
         "Field_id": 0,
         "Name": "report_time",
         "Type": "INT64",
         "Type_length": 0,
         "LogicalType": "LogicalType(STRING: StringType())",
         "Scale": 0,
         "Precision": 0,
         "Repetition_type": "REQUIRED",
         "Converted_type": "TIMESTAMP_MICROS"
       }`
   
   included avro schema in the parquet / meta
   `{
    "type":"long",
    "logicalType":"timestamp-micros"}
    }`
   
   which would translate into this
   `{
         "Field_id": 0,
         "Name": "report_time",
         "Type": "INT64",
         "Type_length": 0,
         "LogicalType": "LogicalType(TimeType (isAdjustedToUTC = true, unit = 
MICROS))",
         "Scale": 0,
         "Precision": 0,
         "Repetition_type": "REQUIRED",
         "Converted_type": "TIMESTAMP_MICROS"
       }`
   
   if the logical type is right (not null) sql parsers can utilze that, and no 
need to the from_unixtime, or other conversions in query. 
   if this is not an issue you can handle, just close this one.
   
   Thanks,
    Darvi
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to