advancedxy commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1300790470


##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
 
   @Override
   protected FileScanTask newSplitTask(FileScanTask parentTask, long offset, 
long length) {
-    return new SplitScanTask(offset, length, parentTask);
+    return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
   }
 
   @Override
   public List<DeleteFile> deletes() {
-    return ImmutableList.copyOf(deletes);
+    if (deletesAsList == null) {
+      this.deletesAsList = ImmutableList.copyOf(deletes);
+    }
+
+    return deletesAsList;
+  }
+
+  @Override
+  public long sizeBytes() {
+    return length() + deletesSizeBytes();
+  }
+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;
   }
 
   @Override
   public Schema schema() {
     return super.schema();
   }
 
+  private long deletesSizeBytes() {
+    if (deletesSizeBytes == null) {

Review Comment:
   Just checked the spark code(flink probably couldn't handle that scale), 
seems I was misunderstanding how tasks are serialized. I used to think only the 
specific task is serialized and sent to executor for one partition.
   
   Seems like the rdd is serialized as a whole task binary for each task and 
sent to executor. That definitely adds a lot of overhead even for one extra 
field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to