[
https://issues.apache.org/jira/browse/HUDI-7955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen closed HUDI-7955.
----------------------------
Resolution: Fixed
Fixed via master branch: b0e0efcc5f9da5449a5b6310a7de7996b01318cb
> Account for WritableTimestampObjectInspector#getPrimitiveJavaObject Hive3 and
> Hive2 discrepancies
> -------------------------------------------------------------------------------------------------
>
> Key: HUDI-7955
> URL: https://issues.apache.org/jira/browse/HUDI-7955
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Assignee: voon
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
> Attachments: image-2024-07-05-18-11-33-420.png,
> image-2024-07-05-18-13-28-135.png
>
>
> The invocation of *getPrimitiveJavaObject* returns a different implementation
> of timestamp in Hive3 and Hive2.
> - Hive2: *java.sql.Timestamp*
> - Hive3: *org.apache.hadoop.hive.common.type.Timestamp*
> Hudi common is compiled with Hive2, but Trino is using Hive3, causing the
> discrepancy between compile and runtime. When execution flow falls into this
> section of the code where the trigger conditions are listed below:
> 1. MOR table is used
> 2. User is querying the _rt table
> 3. User's table has a *TIMESTAMP* type and query requires it
> 4. Merge is required as record is present in both Parquet and Log file
> Error below will be thrown:
> {code:java}
> Query 20240704_075218_05052_yfmfc failed: 'java.sql.Timestamp
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
> java.lang.NoSuchMethodError: 'java.sql.Timestamp
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(java.lang.Object)'
> at
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:304)
> at
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:212)
> at
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:121)
> at
> org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:108)
> at
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:185)
> at
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:172)
> at
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:114)
> at
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:49)
> at
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:88)
> at
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
> at
> io.trino.plugin.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:215)
> at
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:88)
> at
> io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120){code}
> h1. Hive3
> !image-2024-07-05-18-11-33-420.png|width=509,height=572!
> h1. Hive2
> !image-2024-07-05-18-13-28-135.png|width=507,height=501!
>
> h1. How to reproduce
>
> {*}Note{*}: Unless specified (in the comments), the SQLs should be executed
> in Spark-SQL.
> {code:java}
> CREATE TABLE dev_hudi.hudi_7955__hive3_timestamp_issue (
> id INT,
> name STRING,
> timestamp_col TIMESTAMP,
> grass_region STRING
> ) USING hudi
> PARTITIONED BY (grass_region)
> tblproperties (
> primaryKey = 'id',
> type = 'mor',
> precombineField = 'id',
> hoodie.index.type = 'BUCKET',
> hoodie.index.bucket.engine = 'CONSISTENT_HASHING',
> hoodie.compact.inline = 'true'
> )
> LOCATION 'hdfs://path/to/hudi_tables/hudi_7955__hive3_timestamp_issue';
> -- 5 separate commits to trigger compaction
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (1, 'alex1',
> now(), 'SG');
> -- No error here as there no MERGE is required between Parquet + Log
> SELECT _hoodie_file_name, id, timestamp_col FROM
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE
> '%parquet%';
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (2, 'alex2',
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (3, 'alex3',
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (4, 'alex4',
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (5, 'alex5',
> now(), 'SG');
> -- Should contain no rows here as table is compacted and all rows are in
> parquet file (RUN IN PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id, timestamp_col FROM
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE
> '%parquet%';
> -- 2 separate commits which will be written to logs (update)
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (4, 'alex4_1',
> now(), 'SG');
> INSERT INTO dev_hudi.hudi_7955__hive3_timestamp_issue VALUES (5, 'alex5_1',
> now(), 'SG');
> -- Not querying any TIMESTAMP columns, no errors will be thrown (RUN IN
> PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id FROM
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE
> '%parquet%';
> -- Error should be thrown here as we are including the timestamp column (RUN
> IN PRESTO/TRINO/HIVE3)
> SELECT _hoodie_file_name, id, timestamp_col FROM
> dev_hudi.hudi_7955__hive3_timestamp_issue_rt WHERE _hoodie_file_name NOT LIKE
> '%parquet%';{code}
>
> h1. Solution
> Hive shimming should be applied to obtaining *getPrimitiveJavaObject* too.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)