[GitHub] [iceberg] rdblue commented on a diff in pull request #7688: Add adaptive split size

via GitHub Thu, 25 May 2023 16:26:43 -0700


rdblue commented on code in PR #7688:
URL: https://github.com/apache/iceberg/pull/7688#discussion_r1206085976



##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -256,4 +265,95 @@ private static Schema 
lazyColumnProjection(TableScanContext context, Schema sche
   public ThisT metricsReporter(MetricsReporter reporter) {
     return newRefinedScan(table(), schema(), context().reportWith(reporter));
   }
+
+  private Optional<Long> adaptiveSplitSize(long tableSplitSize) {
+    if (!PropertyUtil.propertyAsBoolean(
+        table.properties(),
+        TableProperties.ADAPTIVE_SPLIT_PLANNING,
+        TableProperties.ADAPTIVE_SPLIT_PLANNING_DEFAULT)) {
+      return Optional.empty();
+    }
+
+    int minParallelism =
+        PropertyUtil.propertyAsInt(
+            table.properties(),
+            TableProperties.SPLIT_MIN_PARALLELISM,
+            TableProperties.SPLIT_MIN_PARALLELISM_DEFAULT);
+
+    Preconditions.checkArgument(minParallelism > 0, "Minimum parallelism must 
be a positive value");
+
+    Snapshot snapshot =
+        Stream.of(context.snapshotId(), context.toSnapshotId())
+            .filter(Objects::nonNull)
+            .map(table::snapshot)
+            .findFirst()
+            .orElseGet(table::currentSnapshot);
+
+    if (snapshot == null || snapshot.summary() == null) {
+      return Optional.empty();
+    }
+
+    Map<String, String> summary = snapshot.summary();
+    long totalFiles =
+        PropertyUtil.propertyAsLong(summary, 
SnapshotSummary.TOTAL_DATA_FILES_PROP, 0);
+    long totalSize = PropertyUtil.propertyAsLong(summary, 
SnapshotSummary.TOTAL_FILE_SIZE_PROP, 0);
+
+    if (totalFiles <= 0 || totalSize <= 0) {
+      return Optional.empty();
+    }
+
+    if (totalFiles > minParallelism && totalSize >= tableSplitSize * 
minParallelism) {
+      // If it is possible that splits could normally be calculated to meet the
+      // minimum parallelism, do not adjust the split size
+      return Optional.empty();
+    }
+
+    FileFormat fileFormat =
+        FileFormat.fromString(
+            table
+                .properties()
+                .getOrDefault(
+                    TableProperties.DEFAULT_FILE_FORMAT,
+                    TableProperties.DEFAULT_FILE_FORMAT_DEFAULT));
+
+    if (!fileFormat.isSplittable()) {
+      return Optional.of(totalSize / totalFiles);

Review Comment:
   IIUC, after the check above, we have a case where individual files are not 
enough to get the parallelism we want. So this is trying to prevent any 
combining, right?
   
   Why not do that directly with a "do not combine" flag?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #7688: Add adaptive split size

Reply via email to