aokolnychyi commented on code in PR #8346:
URL: https://github.com/apache/iceberg/pull/8346#discussion_r1300592374
##########
core/src/main/java/org/apache/iceberg/BaseFileScanTask.java:
##########
@@ -45,31 +49,67 @@ protected FileScanTask self() {
@Override
protected FileScanTask newSplitTask(FileScanTask parentTask, long offset,
long length) {
- return new SplitScanTask(offset, length, parentTask);
+ return new SplitScanTask(offset, length, deletesSizeBytes(), parentTask);
}
@Override
public List<DeleteFile> deletes() {
- return ImmutableList.copyOf(deletes);
+ if (deletesAsList == null) {
+ this.deletesAsList = ImmutableList.copyOf(deletes);
+ }
+
+ return deletesAsList;
+ }
+
+ @Override
+ public long sizeBytes() {
+ return length() + deletesSizeBytes();
+ }
+
+ @Override
+ public int filesCount() {
+ return 1 + deletes.length;
}
@Override
public Schema schema() {
return super.schema();
}
+ private long deletesSizeBytes() {
+ if (deletesSizeBytes == null) {
Review Comment:
It is not the size but rather the need to serialize extra values. If there
is 1M files and each of them has 4 row groups, it is 4M values to serialize on
the driver. If we read 10M files, that's like 40M extra values.
The new approach should avoid both the serialization as well as the boxing
overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]