[GitHub] [hudi] rmpifer opened a new issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

GitBox Wed, 12 Aug 2020 14:44:23 -0700


rmpifer opened a new issue #1958:
URL: https://github.com/apache/hudi/issues/1958



   **Describe the problem you faced**
   
   When using global indexes (GLOBAL_BLOOM, HBASE) and a records partition 
value has been updated during an upsert the record gets written to the previous 
partition folder. This causes issues when querying Hive _ro and _rt tables as 
it displays the new data but with the previous partition value. We can see this 
by looking at the result of the query for 'ad' partition column and the Hudi 
data stored
   ```
   scala> spark.sql("select * from gb_partition_update_ro").show()
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|     wbn|    cs_ss|     action_date|   ad_updated|         
  ad|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |     20200812204404|  20200812204404_0_1|          12345678|         
2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-12|
   
   scala> 
spark.read.format("parquet").load("s3://ryanpife-emr-dev/hudi/data/delhivery_data_hudi_mor_global_bloom/1/2020-08-06-12/5ecf97ef-b9d9-4e3f-b586-0140f6ebdd3a-0_0-97-39036_20200812204404.parquet").show()
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|     wbn|    cs_ss|     action_date|           ad|   
ad_updated|
   
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
   |     20200812204404|  20200812204404_0_1|          12345678|         
2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-13|
   ```
   
   This is probably due to the table definition in hive reading in partition 
value from folder location
   
   ```
   hive> SHOW CREATE TABLE gb_partition_update_ro;
   OK
   CREATE EXTERNAL TABLE `gb_partition_update_ro`(
     `_hoodie_commit_time` string, 
     `_hoodie_commit_seqno` string, 
     `_hoodie_record_key` string, 
     `_hoodie_partition_path` string, 
     `_hoodie_file_name` string, 
     `wbn` int, 
     `cs_ss` string, 
     `action_date` bigint, 
     `ad_updated` string)
   PARTITIONED BY ( 
     `ad` string)
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   STORED AS INPUTFORMAT 
     'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     's3://ryanpife-emr-dev/hudi/data/delhivery_data_hudi_mor_global_bloom/1'
   TBLPROPERTIES (
     'last_commit_time_sync'='20200812204404', 
     'transient_lastDdlTime'='1597264976')
   ```
   
   This does not seem like correct behavior as the query results return new 
data with stale partition column value
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create table with global index
   2. Insert a record
   3. Upsert same record key with new partition value
   4. Query Hive table (_ro or _rt) for record key and previous partition value 
will be displayed
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * EMR version : 5.30.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rmpifer opened a new issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

Reply via email to