cdmikechen commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1128892203

   @xiarixiaoyao 
   Hi~ Your question is very good !
    
   > 1. Do we really need HudiAvroParquetInputFormat; how about modify 
HoodieRealtimeRecordReaderUtils.avroToArrayWritable directly
   
   `org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat` which 
`HoodieParquetInputFormatBase`  extends or `ParquetRecordReaderWrapper` can 
only read parquet file types without avro logical type. So that we need to 
create a new `RecordReader` to transform paruqet row to avro rows, and then we 
can use `avro schema way` to parse these rows.
   There is another way to solve this problem: Create a new serde class to read 
paruqet-avro files. But that way is more expensive and requires changing 
hive-site.xml. At the same time, we may not be able to modify hive schema.
   
https://github.com/apache/hudi/blob/99555c897acf9bdd576e7ab233dc448d537e7aea/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java#L273-L278
   I have submitted a PR to hive to solve the problem of reading timestamp 
type, but maybe I don't know how hive community handles issue, and that PR has 
been closed. https://github.com/apache/hive/pull/871 
   https://issues.apache.org/jira/projects/HIVE/issues/HIVE-22224 
   I also submitted relevant PR to the drill community to solve the problem of 
reading timestamp. https://github.com/apache/drill/pull/2431
   After I found this problem and after several different attempts, I found 
that the current method should be a simpl and easy way to understand processing 
at present (it may not be the purest method, because in the hive codes, many 
methods are final, so that we can not extends them).
   
   > 2. Do we support time zone?
   
   There is a [TimestampTZ 
class](https://github.com/apache/hive/blob/branch-3.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZ.java)
 in Hive. However, in the past tests, I found that it seems that even if it is 
not explicitly declared as the time zone type, the timestamp will display the 
results according to the current time zone in the query results of hive (I will 
do another test tomorrow to verify it).
   In fact, when it is converted to long, it will only be expressed as UTC 
time. In some specific cases, we can use hive's function to make our own 
transformation.
   If the time zone is a meaningful type, I think we can supplement it with 
spark and flink together later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to