[GitHub] [iceberg] hililiwei commented on pull request #4316: Core:Remove unnecessary row filtering in deleted manifest file

GitBox Tue, 15 Mar 2022 02:07:46 -0700


hililiwei commented on pull request #4316:
URL: https://github.com/apache/iceberg/pull/4316#issuecomment-1067731623



   prerequisite:
   * Table:test_upsert_query
   ```SQL
        id INT NOT NULL, 
        province STRING NOT NULL, 
        dt DATE, 
        PRIMARY KEY(id,province) NOT ENFORCED
        PARTITIONED BY (province)
        WITH
        ('write.format.default'='PARQUET',
         'write.upsert.enabled'='true',
         'format-version'='2')
   ```
   
   1. Execute SQL:
   ```SQL
   INSERT INTO test_upsert_query 
   VALUES 
   (1, 'a', TO_DATE('2022-03-01')),
   (2, 'b', TO_DATE('2022-03-01')),
   (1, 'b', TO_DATE('2022-03-01'))
   ```
   
   Two manifest files are generated：
   * 5385d1b0-82d6-4328-8115-eb9761f68aef-m0.avro
   * 5385d1b0-82d6-4328-8115-eb9761f68aef-m1.avro
   
   m1.avo is a delete manifest file, view it:
   ```
   java -jar ~/plat/iceberg-tools-1.0-SNAPSHOT.jar manifest2json 
5385d1b0-82d6-4328-8115-eb9761f68aef-m1.avro
   
   {
                "status": 1,
                "snapshot_id": {
                        "long": 8012546586091949255
                },
                "sequence_number": null,
                "data_file": {
                        "content": 2,
                        "file_path": 
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=a/00000-0-5dc08ee2-d7e7-4348-a726-99ac0041c24f-00002.parquet",
                        "file_format": "PARQUET",
                        "partition": {
                                "province": {
                                        "string": "a"
                                }
                        },
                        …………
                        ……omitted non-critical information……
                        …………
                        "sort_order_id": {
                                "int": 0
                        }
                }
        }
        {
                "status": 1,
                "snapshot_id": {
                        "long": 8012546586091949255
                },
                "sequence_number": null,
                "data_file": {
                        "content": 2,
                        "file_path": 
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=b/00000-0-5dc08ee2-d7e7-4348-a726-99ac0041c24f-00004.parquet",
                        "file_format": "PARQUET",
                        "partition": {
                                "province": {
                                        "string": "b"
                                }
                        },
                        "record_count": 2,
                        "file_size_in_bytes": 1053,
                        "column_sizes": {
                                "array": [{
                                        "key": 1,
                                        "value": 49
                                }, {
                                        "key": 2,
                                        "value": 89
                                }, {
                                        "key": 3,
                                        "value": 94
                                }]
                        },
                        "value_counts": {
                                "array": [{
                                        "key": 1,
                                        "value": 2
                                }, {
                                        "key": 2,
                                        "value": 2
                                }, {
                                        "key": 3,
                                        "value": 2
                                }]
                        },
                        "null_value_counts": {
                                "array": [{
                                        "key": 1,
                                        "value": 0
                                }, {
                                        "key": 2,
                                        "value": 0
                                }, {
                                        "key": 3,
                                        "value": 0
                                }]
                        },
                        "nan_value_counts": {
                                "array": []
                        },
                        "lower_bounds": {
                                "array": [{
                                        "key": 1,
                                        "value": "1"
                                }, {
                                        "key": 2,
                                        "value": "b"
                                }, {
                                        "key": 3,
                                        "value": "19052"
                                }]
                        },
                        "upper_bounds": {
                                "array": [{
                                        "key": 1,
                                        "value": "2"
                                }, {
                                        "key": 2,
                                        "value": "b"
                                }, {
                                        "key": 3,
                                        "value": "19052"
                                }]
                        },
                        "key_metadata": null,
                        "split_offsets": null,
                        "equality_ids": {
                                "array": [1, 2]
                        },
                        "sort_order_id": {
                                "int": 0
                        }
                }
        }
   
   ```
   
   Notice `upper_bounds`\ `lower_bounds`. We can see that the upper/lower value 
of the dt field(key=3) is 19052. 
   **Remember this 19052.**
   
   Execute upsert：
   ```SQL
   INSERT INTO test_upsert_query 
   VALUES 
   (4, 'a', TO_DATE('2022-03-02')),
   (5, 'b', TO_DATE('2022-03-02')),
   (1, 'b', TO_DATE('2022-03-02'))
   ```
   **The dt of (1,'b') is updated to '2022-03-02' (:19053) .**
   
   Check again. The following two manifest files are displayed:：
   * c3fd1626-d26f-4067-b4b0-a245d59a0615-m0.avro
   * c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro
   
   This time, we're still just looking at 
c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro:
   ```
   java -jar ~/plat/iceberg-tools-1.0-SNAPSHOT.jar manifest2json 
c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro
   {
                "status": 1,
                "snapshot_id": {
                        "long": 4408783096627784376
                },
                "sequence_number": null,
                "data_file": {
                        "content": 2,
                        "file_path": 
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=a/00000-0-9749ef4a-b579-43f5-acbb-ad41e32be3c8-00002.parquet",
                        "file_format": "PARQUET",
                        "partition": {
                                "province": {
                                        "string": "a"
                                }
                        },
                        "record_count": 1,
                        …………
                        ……omitted non-critical information……
                        …………
                        "sort_order_id": {
                                "int": 0
                        }
                }
        }
        {
                "status": 1,
                "snapshot_id": {
                        "long": 4408783096627784376
                },
                "sequence_number": null,
                "data_file": {
                        "content": 2,
                        "file_path": 
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=b/00000-0-9749ef4a-b579-43f5-acbb-ad41e32be3c8-00004.parquet",
                        "file_format": "PARQUET",
                        "partition": {
                                "province": {
                                        "string": "b"
                                }
                        },
                        "record_count": 2,
                        "file_size_in_bytes": 1053,
                        "column_sizes": {
                                "array": [{
                                        "key": 1,
                                        "value": 49
                                }, {
                                        "key": 2,
                                        "value": 89
                                }, {
                                        "key": 3,
                                        "value": 94
                                }]
                        },
                        "value_counts": {
                                "array": [{
                                        "key": 1,
                                        "value": 2
                                }, {
                                        "key": 2,
                                        "value": 2
                                }, {
                                        "key": 3,
                                        "value": 2
                                }]
                        },
                        "null_value_counts": {
                                "array": [{
                                        "key": 1,
                                        "value": 0
                                }, {
                                        "key": 2,
                                        "value": 0
                                }, {
                                        "key": 3,
                                        "value": 0
                                }]
                        },
                        "nan_value_counts": {
                                "array": []
                        },
                        "lower_bounds": {
                                "array": [{
                                        "key": 1,
                                        "value": "1"
                                }, {
                                        "key": 2,
                                        "value": "b"
                                }, {
                                        "key": 3,
                                        "value": "19053"
                                }]
                        },
                        "upper_bounds": {
                                "array": [{
                                        "key": 1,
                                        "value": "5"
                                }, {
                                        "key": 2,
                                        "value": "b"
                                }, {
                                        "key": 3,
                                        "value": "19053"
                                }]
                        },
                        "key_metadata": null,
                        "split_offsets": null,
                        "equality_ids": {
                                "array": [1, 2]
                        },
                        "sort_order_id": {
                                "int": 0
                        }
                }
        }
   
   ```
   
   Now, the value of dt (key=3) in `lower_bounds` is **19053**.
   
   If we  query data at this time:
   ```SQL
   SELECT * FROM test_upsert_query WHERE dt < '2022-03-02'
   ```
   
   During the query, the manifest file is filtered based on the values of 
metric `lower_bounds`\ `upper_bounds`\ etc.
   In the second delete manifest file, `lower_bounds` show us that the min of 
dt is **19053**, which is equal to '2022-03-02'. As a result, the file is 
ignored.
   
   So in the result, it will contain `(1,'b',TO_DATE('2022-03-01'))`.
   
   In this PR, I tried to trim the filter predicate, and only the field in 
`equality_ids` would take effect.
   
   If the process and result of our analysis are wrong, please do not hesitate 
to tell me.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] hililiwei commented on pull request #4316: Core:Remove unnecessary row filtering in deleted manifest file

Reply via email to