[GitHub] [hudi] cdmikechen commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

GitBox Tue, 17 May 2022 06:46:22 -0700


cdmikechen commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1128892203

@xiarixiaoyao
Hi~ Your question is very good !

> 1. Do we really need HudiAvroParquetInputFormat; how about modify
HoodieRealtimeRecordReaderUtils.avroToArrayWritable directly

`org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat` which
`HoodieParquetInputFormatBase` extends or `ParquetRecordReaderWrapper` can
only read parquet file types without avro logical type. So that we need to
create a new `RecordReader` to transform paruqet row to avro rows, and then we
can use `avro schema way` to parse these rows.
There is another way to solve this problem: Create a new serde class to read
paruqet-avro files. But that way is more expensive and requires changing
hive-site.xml. At the same time, we may not be able to modify hive schema.

https://github.com/apache/hudi/blob/99555c897acf9bdd576e7ab233dc448d537e7aea/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java#L273-L278
I have submitted a PR to hive to solve the problem of reading timestamp
type, but maybe I don't know how hive community handles issue, and that PR has
been closed. https://github.com/apache/hive/pull/871
https://issues.apache.org/jira/projects/HIVE/issues/HIVE-22224
I also submitted relevant PR to the drill community to solve the problem of
reading timestamp. https://github.com/apache/drill/pull/2431
After I found this problem and after several different attempts, I found
that the current method should be a simpl and easy way to understand processing
at present (it may not be the purest method, because in the hive codes, many
methods are final, so that we can not extends them).

> 2. Do we support time zone？

There is a [TimestampTZ
class](https://github.com/apache/hive/blob/branch-3.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZ.java)
in Hive. However, in the past tests, I found that it seems that even if it is
not explicitly declared as the time zone type, the timestamp will display the
results according to the current time zone in the query results of hive (I will
do another test tomorrow to verify it).
In fact, when it is converted to long, it will only be expressed as UTC
time. In some specific cases, we can use hive's function to make our own
transformation.
If the time zone is a meaningful type, I think we can supplement it with
spark and flink together later.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] cdmikechen commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

Reply via email to