[GitHub] [iceberg] advancedxy commented on a diff in pull request #8346: Core: Lazily cache user-facing state in BaseFileScanTask

via GitHub Thu, 17 Aug 2023 23:50:23 -0700


advancedxy commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1298054453



##########
core/src/main/java/org/apache/iceberg/BaseContentScanTask.java:
##########
@@ -92,6 +93,15 @@ public Expression residual() {
     return residuals.residualFor(file.partition());
   }
 
+  @Override
+  public long estimatedRowsCount() {

Review Comment:
   I'm not sure about this. `estimateRowsCount` should rarely be called 
multiple times? Do we really need to lazily cache this? 
   
   I used to like doing a lot of micro optimizations, however I do appreciate 
that `premature optimization is the root of all evil` these days more and more. 
   So in my opinion, if you noticed/spotted this piece of code is the 
bottleneck(such as in a flame graph), I think this is a nice optimization, 
otherwise, we can left it as it is.



##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -28,6 +28,10 @@ public class BaseFileScanTask extends 
BaseContentScanTask<FileScanTask, DataFile
     implements FileScanTask {
   private final DeleteFile[] deletes;
 
+  // lazy variables
+  private transient volatile List<DeleteFile> deletesAsList = null;
+  private transient volatile Long deletesSizeBytes = null;

Review Comment:
   Same as the long vs Long in `BaseContentScanTask`, I would prefer long 
instead.
   
   Java's `Long` adds at least 16/24 bytes class header overhead compared to 
`long`.
   There is also boxing overhead when returning as long.



##########
core/src/main/java/org/apache/iceberg/BaseContentScanTask.java:
##########
@@ -92,6 +93,15 @@ public Expression residual() {
     return residuals.residualFor(file.partition());
   }
 
+  @Override
+  public long estimatedRowsCount() {

Review Comment:
   By the way, putting premature optimization aside, If I am micro-optimization 
this. I would probably choose:
   
   ```java
   private transient volatile long estimatedRowsCount = -1; // or 
Long.MIN_VALUE 
   
   public long estimatedRowsCount() {
     if (estimatedRowCount == -1) {
       this.estimatedRowsCount = ContentScanTask.super.estimatedRowsCount();
     }
     return estimatedRowsCount
   }
   ```



##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
 
   @Override
   protected FileScanTask newSplitTask(FileScanTask parentTask, long offset, 
long length) {
-    return new SplitScanTask(offset, length, parentTask);
+    return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
   }
 
   @Override
   public List<DeleteFile> deletes() {
-    return ImmutableList.copyOf(deletes);
+    if (deletesAsList == null) {
+      this.deletesAsList = ImmutableList.copyOf(deletes);
+    }
+
+    return deletesAsList;
+  }
+
+  @Override
+  public long sizeBytes() {
+    return length() + deletesSizeBytes();
+  }
+
+  @Override
+  public int filesCount() {
+    return 1 + deletes.length;
   }
 
   @Override
   public Schema schema() {
     return super.schema();
   }
 
+  private long deletesSizeBytes() {
+    if (deletesSizeBytes == null) {

Review Comment:
   did another look at this, how about we calculate this `deleteSizeBytes` 
fields in the constructor directly?
   
   No need to special check a -1 or Long.MIN_VALUE any more...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] advancedxy commented on a diff in pull request #8346: Core: Lazily cache user-facing state in BaseFileScanTask

Reply via email to