[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7688: Add adaptive split size

via GitHub Fri, 26 May 2023 13:45:19 -0700


danielcweeks commented on code in PR #7688:
URL: https://github.com/apache/iceberg/pull/7688#discussion_r1207327657



##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -256,4 +265,95 @@ private static Schema 
lazyColumnProjection(TableScanContext context, Schema sche
   public ThisT metricsReporter(MetricsReporter reporter) {
     return newRefinedScan(table(), schema(), context().reportWith(reporter));
   }
+
+  private Optional<Long> adaptiveSplitSize(long tableSplitSize) {
+    if (!PropertyUtil.propertyAsBoolean(
+        table.properties(),
+        TableProperties.ADAPTIVE_SPLIT_PLANNING,
+        TableProperties.ADAPTIVE_SPLIT_PLANNING_DEFAULT)) {
+      return Optional.empty();
+    }
+
+    int minParallelism =
+        PropertyUtil.propertyAsInt(
+            table.properties(),
+            TableProperties.SPLIT_MIN_PARALLELISM,
+            TableProperties.SPLIT_MIN_PARALLELISM_DEFAULT);
+
+    Preconditions.checkArgument(minParallelism > 0, "Minimum parallelism must 
be a positive value");
+
+    Snapshot snapshot =
+        Stream.of(context.snapshotId(), context.toSnapshotId())
+            .filter(Objects::nonNull)
+            .map(table::snapshot)
+            .findFirst()
+            .orElseGet(table::currentSnapshot);
+
+    if (snapshot == null || snapshot.summary() == null) {
+      return Optional.empty();
+    }
+
+    Map<String, String> summary = snapshot.summary();
+    long totalFiles =
+        PropertyUtil.propertyAsLong(summary, 
SnapshotSummary.TOTAL_DATA_FILES_PROP, 0);
+    long totalSize = PropertyUtil.propertyAsLong(summary, 
SnapshotSummary.TOTAL_FILE_SIZE_PROP, 0);

Review Comment:
   I think we might be able to combine these two approaches in a reasonable way 
that's more generalizable.
   
   > I don't think looking at the total snapshot size or even partition stats 
will be that representative. In my view, knowing the amount of data we scan in 
a particular query and the number of slots in the cluster is critical. That's 
why I thought we would implement this feature at a higher level.
   
   I agree with this.  However, the most common places where this is a problem 
are really simple cases of unpartitioned tables with very little data.  This 
approach will only take effect if the table size is great than `minParallelism 
* splitSize` effectively.  So pretty much anything over a couple GB wouldn't be 
affected.
   
   > Whenever we scan huge tables, we see a huge difference between 128MB and 
let's say 512MB or 1GB split size.
   
   We've seen this in a lot of cases and you may even want to adjust to higher 
splits sizes if you're projecting fewer or smaller columns because the 
calculated splits is based on the whole row group size, but processing a few 
int columns can be much faster than string columns.
   
   > Relying on a table property for parallelism seems like shifting the 
complexity of tuning the split size. It varies from query to query and from 
cluster to cluster.
   
   I agree here as well, but I was hoping for a solution that wouldn't be spark 
specific.  I'm wondering if we can put most of the logic in terms of adjusting 
the split size here and then pass through the relevant information (scan size, 
parallelism, etc.) through the scan context.  That way we can leverage those 
properties in other engines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7688: Add adaptive split size

Reply via email to