advancedxy commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1298054453
##########
core/src/main/java/org/apache/iceberg/BaseContentScanTask.java:
##########
@@ -92,6 +93,15 @@ public Expression residual() {
return residuals.residualFor(file.partition());
}
+ @Override
+ public long estimatedRowsCount() {
Review Comment:
I'm not sure about this. `estimateRowsCount` should rarely be called
multiple times? Do we really need to lazily cache this?
I used to like doing a lot of micro optimizations, however I do appreciate
that `premature optimization is the root of all evil` these days more and more.
So in my opinion, if you noticed/spotted this piece of code is the
bottleneck(such as in a flame graph), I think this is a nice optimization,
otherwise, we can left it as it is.
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -28,6 +28,10 @@ public class BaseFileScanTask extends
BaseContentScanTask<FileScanTask, DataFile
implements FileScanTask {
private final DeleteFile[] deletes;
+ // lazy variables
+ private transient volatile List<DeleteFile> deletesAsList = null;
+ private transient volatile Long deletesSizeBytes = null;
Review Comment:
Same as the long vs Long in `BaseContentScanTask`, I would prefer long
instead.
Java's `Long` adds at least 16/24 bytes class header overhead compared to
`long`.
There is also boxing overhead when returning as long.
##########
core/src/main/java/org/apache/iceberg/BaseContentScanTask.java:
##########
@@ -92,6 +93,15 @@ public Expression residual() {
return residuals.residualFor(file.partition());
}
+ @Override
+ public long estimatedRowsCount() {
Review Comment:
By the way, putting premature optimization aside, If I am micro-optimization
this. I would probably choose:
```java
private transient volatile long estimatedRowsCount = -1; // or
Long.MIN_VALUE
public long estimatedRowsCount() {
if (estimatedRowCount == -1) {
this.estimatedRowsCount = ContentScanTask.super.estimatedRowsCount();
}
return estimatedRowsCount
}
```
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
@Override
protected FileScanTask newSplitTask(FileScanTask parentTask, long offset,
long length) {
- return new SplitScanTask(offset, length, parentTask);
+ return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
}
@Override
public List<DeleteFile> deletes() {
- return ImmutableList.copyOf(deletes);
+ if (deletesAsList == null) {
+ this.deletesAsList = ImmutableList.copyOf(deletes);
+ }
+
+ return deletesAsList;
+ }
+
+ @Override
+ public long sizeBytes() {
+ return length() + deletesSizeBytes();
+ }
+
+ @Override
+ public int filesCount() {
+ return 1 + deletes.length;
}
@Override
public Schema schema() {
return super.schema();
}
+ private long deletesSizeBytes() {
+ if (deletesSizeBytes == null) {
Review Comment:
did another look at this, how about we calculate this `deleteSizeBytes`
fields in the constructor directly?
No need to special check a -1 or Long.MIN_VALUE any more...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]