hililiwei commented on pull request #4316: URL: https://github.com/apache/iceberg/pull/4316#issuecomment-1067731623
prerequisite:
* Table:test_upsert_query
```SQL
id INT NOT NULL,
province STRING NOT NULL,
dt DATE,
PRIMARY KEY(id,province) NOT ENFORCED
PARTITIONED BY (province)
WITH
('write.format.default'='PARQUET',
'write.upsert.enabled'='true',
'format-version'='2')
```
1. Execute SQL:
```SQL
INSERT INTO test_upsert_query
VALUES
(1, 'a', TO_DATE('2022-03-01')),
(2, 'b', TO_DATE('2022-03-01')),
(1, 'b', TO_DATE('2022-03-01'))
```
Two manifest files are generated:
* 5385d1b0-82d6-4328-8115-eb9761f68aef-m0.avro
* 5385d1b0-82d6-4328-8115-eb9761f68aef-m1.avro
m1.avo is a delete manifest file, view it:
```
java -jar ~/plat/iceberg-tools-1.0-SNAPSHOT.jar manifest2json
5385d1b0-82d6-4328-8115-eb9761f68aef-m1.avro
{
"status": 1,
"snapshot_id": {
"long": 8012546586091949255
},
"sequence_number": null,
"data_file": {
"content": 2,
"file_path":
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=a/00000-0-5dc08ee2-d7e7-4348-a726-99ac0041c24f-00002.parquet",
"file_format": "PARQUET",
"partition": {
"province": {
"string": "a"
}
},
…………
……omitted non-critical information……
…………
"sort_order_id": {
"int": 0
}
}
}
{
"status": 1,
"snapshot_id": {
"long": 8012546586091949255
},
"sequence_number": null,
"data_file": {
"content": 2,
"file_path":
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=b/00000-0-5dc08ee2-d7e7-4348-a726-99ac0041c24f-00004.parquet",
"file_format": "PARQUET",
"partition": {
"province": {
"string": "b"
}
},
"record_count": 2,
"file_size_in_bytes": 1053,
"column_sizes": {
"array": [{
"key": 1,
"value": 49
}, {
"key": 2,
"value": 89
}, {
"key": 3,
"value": 94
}]
},
"value_counts": {
"array": [{
"key": 1,
"value": 2
}, {
"key": 2,
"value": 2
}, {
"key": 3,
"value": 2
}]
},
"null_value_counts": {
"array": [{
"key": 1,
"value": 0
}, {
"key": 2,
"value": 0
}, {
"key": 3,
"value": 0
}]
},
"nan_value_counts": {
"array": []
},
"lower_bounds": {
"array": [{
"key": 1,
"value": "1"
}, {
"key": 2,
"value": "b"
}, {
"key": 3,
"value": "19052"
}]
},
"upper_bounds": {
"array": [{
"key": 1,
"value": "2"
}, {
"key": 2,
"value": "b"
}, {
"key": 3,
"value": "19052"
}]
},
"key_metadata": null,
"split_offsets": null,
"equality_ids": {
"array": [1, 2]
},
"sort_order_id": {
"int": 0
}
}
}
```
Notice `upper_bounds`\ `lower_bounds`. We can see that the upper/lower value
of the dt field(key=3) is 19052.
**Remember this 19052.**
Execute upsert:
```SQL
INSERT INTO test_upsert_query
VALUES
(4, 'a', TO_DATE('2022-03-02')),
(5, 'b', TO_DATE('2022-03-02')),
(1, 'b', TO_DATE('2022-03-02'))
```
**The dt of (1,'b') is updated to '2022-03-02' (:19053) .**
Check again. The following two manifest files are displayed::
* c3fd1626-d26f-4067-b4b0-a245d59a0615-m0.avro
* c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro
This time, we're still just looking at
c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro:
```
java -jar ~/plat/iceberg-tools-1.0-SNAPSHOT.jar manifest2json
c3fd1626-d26f-4067-b4b0-a245d59a0615-m1.avro
{
"status": 1,
"snapshot_id": {
"long": 4408783096627784376
},
"sequence_number": null,
"data_file": {
"content": 2,
"file_path":
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=a/00000-0-9749ef4a-b579-43f5-acbb-ad41e32be3c8-00002.parquet",
"file_format": "PARQUET",
"partition": {
"province": {
"string": "a"
}
},
"record_count": 1,
…………
……omitted non-critical information……
…………
"sort_order_id": {
"int": 0
}
}
}
{
"status": 1,
"snapshot_id": {
"long": 4408783096627784376
},
"sequence_number": null,
"data_file": {
"content": 2,
"file_path":
"file:/tmp/junit3813726988943192507/db.db/test_upsert_query/data/province=b/00000-0-9749ef4a-b579-43f5-acbb-ad41e32be3c8-00004.parquet",
"file_format": "PARQUET",
"partition": {
"province": {
"string": "b"
}
},
"record_count": 2,
"file_size_in_bytes": 1053,
"column_sizes": {
"array": [{
"key": 1,
"value": 49
}, {
"key": 2,
"value": 89
}, {
"key": 3,
"value": 94
}]
},
"value_counts": {
"array": [{
"key": 1,
"value": 2
}, {
"key": 2,
"value": 2
}, {
"key": 3,
"value": 2
}]
},
"null_value_counts": {
"array": [{
"key": 1,
"value": 0
}, {
"key": 2,
"value": 0
}, {
"key": 3,
"value": 0
}]
},
"nan_value_counts": {
"array": []
},
"lower_bounds": {
"array": [{
"key": 1,
"value": "1"
}, {
"key": 2,
"value": "b"
}, {
"key": 3,
"value": "19053"
}]
},
"upper_bounds": {
"array": [{
"key": 1,
"value": "5"
}, {
"key": 2,
"value": "b"
}, {
"key": 3,
"value": "19053"
}]
},
"key_metadata": null,
"split_offsets": null,
"equality_ids": {
"array": [1, 2]
},
"sort_order_id": {
"int": 0
}
}
}
```
Now, the value of dt (key=3) in `lower_bounds` is **19053**.
If we query data at this time:
```SQL
SELECT * FROM test_upsert_query WHERE dt < '2022-03-02'
```
During the query, the manifest file is filtered based on the values of
metric `lower_bounds`\ `upper_bounds`\ etc.
In the second delete manifest file, `lower_bounds` show us that the min of
dt is **19053**, which is equal to '2022-03-02'. As a result, the file is
ignored.
So in the result, it will contain `(1,'b',TO_DATE('2022-03-01'))`.
In this PR, I tried to trim the filter predicate, and only the field in
`equality_ids` would take effect.
If the process and result of our analysis are wrong, please do not hesitate
to tell me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
