[GitHub] [hudi] vicuna96 opened a new issue, #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

GitBox Wed, 22 Jun 2022 10:56:46 -0700


vicuna96 opened a new issue, #5942:
URL: https://github.com/apache/hudi/issues/5942


   
   **Describe the problem you faced**
   
   Case 1.
   We are currently trying to create a partial upsert pipeline with global 
index (GLOBAL_BLOOM). The issue that we face is that when setting 
HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "true", we 
notice that **the columns not updated by the partial update are dropped / 
nullified**.
   
   Case 2.
   In addition, as an alternative we are exploring using 
HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "false". 
However, in this case we notice that while the metadata column 
`_hoodie_partition_path` does not get updated, our partition field does.
   In our unit testing, this means that for the record in question, the columns 
become `_hoodie_partition_path: partitionField=2022-05-07` and `partitionField: 
2022-05-08`. We are wondering if there is any implications to this. For 
example, if there is any pruning in place on `_hoodie_partition_path`, is our 
record with mismatch in partition column info prone to any inconsistencies?
   
   **To Reproduce**
   Case 1.
   Define 
   ```
   case class TestHudiTable(keyField: String, stringField: String, numberField: 
Int, precombineField: Timestamp, partitionField: Date)
       val targetGlobalPartition = "2022-05-08"
   
       val insertRecords = Seq(
         TestHudiTable("key1", "value3", 55, Timestamp.valueOf("2022-05-07 
08:00:00"), Date.valueOf("2022-05-07")),
         TestHudiTable("key2", "value4", 66, Timestamp.valueOf("2022-05-07 
09:00:00"), Date.valueOf("2022-05-07")),
         TestHudiTable("key3", "value4", 77, Timestamp.valueOf("2022-05-07 
10:00:00"), Date.valueOf("2022-05-07")))
   
       val insertDF = insertRecords.toDF(keyField, stringField, numberField, 
precombineField, partitionField)
         .withColumn(precombineField, col(precombineField).cast(TimestampType))
         .withColumn(partitionField, to_date(col(partitionField)))
   ```
   Then run an original insert of these records. Finally, test the partial 
upsert with the following records, using 
org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
   ```
       val partialUpdates = Seq(
         ("key1", "value5", "2022-05-07T11:00:00", "2022-05-07"),
         ("key3", "value6", "2022-05-07T12:00:00", targetGlobalPartition)).toDF(
           keyField, stringField, precombineField, partitionField).withColumn(
           precombineField, 
col(precombineField).cast(TimestampType)).withColumn(
           partitionField, to_date(col(partitionField)))
   ```
   
   Hence, we are testing a partial update that updates most columns except 
numberField, which will be null.
   ```
   **Before partial update to records corresponding to key1 and key3.**
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                 
                                        
|keyField|stringField|numberField|precombineField    |partitionField|
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622111807610  |20220622111807610_0_3|keyField:key1     
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key1
    |value3     |55         |2022-05-07 08:00:00|2022-05-07    |
   |20220622111807610  |20220622111807610_0_4|keyField:key3     
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key3
    |value4     |77         |2022-05-07 10:00:00|2022-05-07    |
   |20220622111813282  |20220622111813282_0_4|keyField:key2     
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key2
    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   **After partial update to key1 and key3, with the latter also updating the 
partition column.**
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                 
                                        
|keyField|stringField|numberField|precombineField    |partitionField|
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622111818415  |20220622111818415_1_6|keyField:key3     
|partitionField=2022-05-08|bce85553-9df0-4d02-a62c-0b2b81e58969-0_1-273-239_20220622111818415.parquet|key3
    |value6     |null       |2022-05-07 12:00:00|2022-05-08    |
   |20220622111818415  |20220622111818415_0_5|keyField:key1     
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-273-238_20220622111818415.parquet|key1
    |value5     |55         |2022-05-07 11:00:00|2022-05-07    |
   |20220622111813282  |20220622111813282_0_4|keyField:key2     
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-273-238_20220622111818415.parquet|key2
    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   ```
   We notice that the value for numberField is dropped upon the update.
   
   Case 2.
   For the second case, the setup and procedure is exactly the same, but we use 
instead HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> 
"false".
   
   All the fields are updated -- in particular, numberField is not dropped. 
However, we see the inconsistency across `partitionField` and 
`_hoodie_partition_path`. The results are shown below.
   
   ```
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                 
                                        
|keyField|stringField|numberField|precombineField    |partitionField|
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622114109089  |20220622114109089_0_3|keyField:key1     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key1
    |value3     |55         |2022-05-07 08:00:00|2022-05-07    |
   |20220622114109089  |20220622114109089_0_4|keyField:key3     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key3
    |value4     |77         |2022-05-07 10:00:00|2022-05-07    |
   |20220622114117004  |20220622114117004_0_4|keyField:key2     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key2
    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                 
                                        
|keyField|stringField|numberField|precombineField    |partitionField|
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622114126357  |20220622114126357_0_5|keyField:key1     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key1
    |value5     |55         |2022-05-07 11:00:00|2022-05-07    |
   |20220622114126357  |20220622114126357_0_6|keyField:key3     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key3
    |value6     |77         |2022-05-07 12:00:00|2022-05-08    |
   |20220622114117004  |20220622114117004_0_4|keyField:key2     
|partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key2
    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   ```
   
   
   **Expected behavior**
   
   Case 1. 
   In the first case, the hope is that the partial update any columns which are 
not null on the incremental load, and that any columns which are null in the 
incremental load but not null in hdfs are not nullified (numberField is 
nullified this case.
   
   Case 2.
   In this case, we want to confirm that this is expected behavior. In 
particular, we did not expect `partitionField` to be updated so it would remain 
consistent with `_hoodie_partition_path`. If this is indeed expected behavior, 
we would like to know how this may affect any pruning on reads from the dataset 
based on the partition column. 
   
   Currently, we see that a read filtered on `partitionField` is showing the 
updated record, but are unsure if there is any other pruning which instead uses 
`_hoodie_partition_path` on the backend, and which we should be aware of
   
   ```
       readGlobalDataset().where(col(partitionField) === 
targetGlobalPartition).show(false)
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                 
                                        
|keyField|stringField|numberField|precombineField    |partitionField|
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622100847473  |20220622100847473_0_6|keyField:key3     
|partitionField=2022-05-07|1f801a9d-5284-4e89-be48-c64273a4af79-0_0-273-237_20220622100847473.parquet|key3
    |value6     |77         |2022-05-07 12:00:00|2022-05-08    |
   
+-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   ```
   
   
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 2.4.8
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.10.1
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : No, running on Dataproc 1.5
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vicuna96 opened a new issue, #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Reply via email to