[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Optimize computing user-facing state in data task

via GitHub Tue, 22 Aug 2023 08:10:47 -0700


aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1301800846



##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -123,7 +170,19 @@ public boolean canMerge(ScanTask other) {
     @Override
     public SplitScanTask merge(ScanTask other) {
       SplitScanTask that = (SplitScanTask) other;
-      return new SplitScanTask(offset, len + that.length(), fileScanTask);
+      return new SplitScanTask(offset, len + that.length(), fileScanTask, 
deletesSizeBytes);

Review Comment:
   When `SplitScanTask` is created from `BaseFileScanTask` during planning, we 
are using `deleteSizeBytes()` to trigger the computation once per 
`BaseFileScanTask` instead of once per `SplitScanTask` (the number of split 
tasks usually matches the number of row groups). This is because each 
`SplitScanTask` has the same set of deletes as the parent task it was created 
from. After planning, adjacent `SplitScanTask`s in the same bin are merged 
together. While merging we are using `deleteSizeBytes` variable (not method), 
which should be already populated if the split tasks were created from the 
parent task. If the variable is not populated, it means tasks were created via 
a separate process (like parsing). In that case, we don't know whether it is 
beneficial to compute `deleteSizeBytes` so we skip computing it unless 
requested. 



##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -123,7 +170,19 @@ public boolean canMerge(ScanTask other) {
     @Override
     public SplitScanTask merge(ScanTask other) {
       SplitScanTask that = (SplitScanTask) other;
-      return new SplitScanTask(offset, len + that.length(), fileScanTask);
+      return new SplitScanTask(offset, len + that.length(), fileScanTask, 
deletesSizeBytes);

Review Comment:
   When `SplitScanTask` is created from `BaseFileScanTask` during planning, we 
are using `deleteSizeBytes()` to trigger the computation once per 
`BaseFileScanTask` instead of once per `SplitScanTask` (the number of split 
tasks usually matches the number of row groups). This is because each 
`SplitScanTask` has the same set of deletes as the parent task it was created 
from. After planning, adjacent `SplitScanTask`s in the same bin are merged 
together. While merging we are using `deleteSizeBytes` variable (not method), 
which should be already populated if the split tasks were created from the 
parent task. If the variable is not populated, it means tasks were created via 
a separate process (like parsing). In that case, we don't know whether it is 
beneficial to compute the size of deletes so we skip computing it unless 
requested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Optimize computing user-facing state in data task

Reply via email to