danielcweeks commented on code in PR #7688:
URL: https://github.com/apache/iceberg/pull/7688#discussion_r1207327657
##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -256,4 +265,95 @@ private static Schema
lazyColumnProjection(TableScanContext context, Schema sche
public ThisT metricsReporter(MetricsReporter reporter) {
return newRefinedScan(table(), schema(), context().reportWith(reporter));
}
+
+ private Optional<Long> adaptiveSplitSize(long tableSplitSize) {
+ if (!PropertyUtil.propertyAsBoolean(
+ table.properties(),
+ TableProperties.ADAPTIVE_SPLIT_PLANNING,
+ TableProperties.ADAPTIVE_SPLIT_PLANNING_DEFAULT)) {
+ return Optional.empty();
+ }
+
+ int minParallelism =
+ PropertyUtil.propertyAsInt(
+ table.properties(),
+ TableProperties.SPLIT_MIN_PARALLELISM,
+ TableProperties.SPLIT_MIN_PARALLELISM_DEFAULT);
+
+ Preconditions.checkArgument(minParallelism > 0, "Minimum parallelism must
be a positive value");
+
+ Snapshot snapshot =
+ Stream.of(context.snapshotId(), context.toSnapshotId())
+ .filter(Objects::nonNull)
+ .map(table::snapshot)
+ .findFirst()
+ .orElseGet(table::currentSnapshot);
+
+ if (snapshot == null || snapshot.summary() == null) {
+ return Optional.empty();
+ }
+
+ Map<String, String> summary = snapshot.summary();
+ long totalFiles =
+ PropertyUtil.propertyAsLong(summary,
SnapshotSummary.TOTAL_DATA_FILES_PROP, 0);
+ long totalSize = PropertyUtil.propertyAsLong(summary,
SnapshotSummary.TOTAL_FILE_SIZE_PROP, 0);
Review Comment:
I think we might be able to combine these two approaches in a reasonable way
that's more generalizable.
> I don't think looking at the total snapshot size or even partition stats
will be that representative. In my view, knowing the amount of data we scan in
a particular query and the number of slots in the cluster is critical. That's
why I thought we would implement this feature at a higher level.
I agree with this. However, the most common places where this is a problem
are really simple cases of unpartitioned tables with very little data. This
approach will only take effect if the table size is great than `minParallelism
* splitSize` effectively. So pretty much anything over a couple GB wouldn't be
affected.
> Whenever we scan huge tables, we see a huge difference between 128MB and
let's say 512MB or 1GB split size.
We've seen this in a lot of cases and you may even want to adjust to higher
splits sizes if you're projecting fewer or smaller columns because the
calculated splits is based on the whole row group size, but processing a few
int columns can be much faster than string columns.
> Relying on a table property for parallelism seems like shifting the
complexity of tuning the split size. It varies from query to query and from
cluster to cluster.
I agree here as well, but I was hoping for a solution that wouldn't be spark
specific. I'm wondering if we can put most of the logic in terms of adjusting
the split size here and then pass through the relevant information (scan size,
parallelism, etc.) through the scan context. That way we can leverage those
properties in other engines.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]