aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1301800846
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -123,7 +170,19 @@ public boolean canMerge(ScanTask other) {
@Override
public SplitScanTask merge(ScanTask other) {
SplitScanTask that = (SplitScanTask) other;
- return new SplitScanTask(offset, len + that.length(), fileScanTask);
+ return new SplitScanTask(offset, len + that.length(), fileScanTask,
deletesSizeBytes);
Review Comment:
When `SplitScanTask` is created from `BaseFileScanTask` during planning, we
are using `deleteSizeBytes()` to trigger the computation once per
`BaseFileScanTask` instead of once per `SplitScanTask` (the number of split
tasks usually matches the number of row groups). This is because each
`SplitScanTask` has the same set of deletes as the parent task it was created
from. After planning, adjacent `SplitScanTask`s in the same bin are merged
together. While merging we are using `deleteSizeBytes` variable (not method),
which should be already populated if the split tasks were created from the
parent task. If the variable is not populated, it means tasks were created via
a separate process (like parsing). In that case, we don't know whether it is
beneficial to compute `deleteSizeBytes` so we skip computing it unless
requested.
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -123,7 +170,19 @@ public boolean canMerge(ScanTask other) {
@Override
public SplitScanTask merge(ScanTask other) {
SplitScanTask that = (SplitScanTask) other;
- return new SplitScanTask(offset, len + that.length(), fileScanTask);
+ return new SplitScanTask(offset, len + that.length(), fileScanTask,
deletesSizeBytes);
Review Comment:
When `SplitScanTask` is created from `BaseFileScanTask` during planning, we
are using `deleteSizeBytes()` to trigger the computation once per
`BaseFileScanTask` instead of once per `SplitScanTask` (the number of split
tasks usually matches the number of row groups). This is because each
`SplitScanTask` has the same set of deletes as the parent task it was created
from. After planning, adjacent `SplitScanTask`s in the same bin are merged
together. While merging we are using `deleteSizeBytes` variable (not method),
which should be already populated if the split tasks were created from the
parent task. If the variable is not populated, it means tasks were created via
a separate process (like parsing). In that case, we don't know whether it is
beneficial to compute the size of deletes so we skip computing it unless
requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]