[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Optimize computing user-facing state in data task

via GitHub Mon, 21 Aug 2023 13:05:16 -0700


aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1300592374



##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
 
   @Override
   protected FileScanTask newSplitTask(FileScanTask parentTask, long offset, 
long length) {
-    return new SplitScanTask(offset, length, parentTask);
+    return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
   }
 
   @Override
   public List<DeleteFile> deletes() {
-    return ImmutableList.copyOf(deletes);
+    if (deletesAsList == null) {
+      this.deletesAsList = ImmutableList.copyOf(deletes);
+    }
+
+    return deletesAsList;
+  }
+
+  @Override
+  public long sizeBytes() {
+    return length() + deletesSizeBytes();
+  }
+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;
   }
 
   @Override
   public Schema schema() {
     return super.schema();
   }
 
+  private long deletesSizeBytes() {
+    if (deletesSizeBytes == null) {

Review Comment:
   It is not the size but rather the need to serialize extra values. If there 
is 1M files and each of them has 4 row groups, it is 4M values to serialize on 
the driver. If we read 10M files, that's like 40M extra values.
   
   The new approach should avoid both the serialization as well as the boxing 
overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8346: Core: Optimize computing user-facing state in data task

Reply via email to