gaborkaszab commented on code in PR #14640:
URL: https://github.com/apache/iceberg/pull/14640#discussion_r2619768868


##########
core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Optional;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.types.Types;
+
+public class BasePartitionStatisticsScan implements PartitionStatisticsScan {
+
+  private final Table table;
+  private Long snapshotId;
+
+  public BasePartitionStatisticsScan(Table table) {
+    this.table = table;
+  }
+
+  @Override
+  public PartitionStatisticsScan useSnapshot(long newSnapshotId) {
+    Preconditions.checkArgument(
+        table.snapshot(newSnapshotId) != null, "Cannot find snapshot with ID 
%s", newSnapshotId);
+
+    this.snapshotId = newSnapshotId;
+    return this;
+  }
+
+  @Override
+  public PartitionStatisticsScan filter(Expression newFilter) {
+    throw new UnsupportedOperationException("Filtering is not supported");
+  }
+
+  @Override
+  public PartitionStatisticsScan project(Schema newSchema) {
+    throw new UnsupportedOperationException("Projection is not supported");
+  }
+
+  @Override
+  public CloseableIterable<PartitionStatistics> scan() {
+    if (snapshotId == null) {
+      if (table.currentSnapshot() == null) {
+        return CloseableIterable.empty();
+      }
+
+      snapshotId = table.currentSnapshot().snapshotId();
+    }
+
+    Optional<PartitionStatisticsFile> statsFile =
+        table.partitionStatisticsFiles().stream()
+            .filter(f -> f.snapshotId() == snapshotId)
+            .findFirst();
+
+    if (statsFile.isEmpty()) {

Review Comment:
   Thanks for taking a look, @findinpath !
   I think in general the "get me the latest available stats" use-case makes 
sense. However, it would introduce much ambiguity into the API because we won't 
know how fresh the returned stats are. I think we have 2 option here:
   1) introduce a way to provide a snapshot ID (other than `useSnapshot()`) 
that will explicitly be used for these latest available searches. I think such 
a functionality should return somehow that how many steps were required to get 
the stats, and maybe some metrics about the stats themselves that doesn't have 
partition stats in the chain. E.g. 3 steps had to make to find partition stats 
and the skipped snapshots added 12 data data files etc. The engine can then 
judge whether using such stats makes sense or not.
   2) Let the engine do the work to find the snapshot with partition stats. 
This way the engine directly can judge if it makes sense to use such stats. 
(e.g. already skipped half of the snapshots, won't continue following the 
chain). I think this is the cleaner approach at the moment. For this 
`PartitionStatsHandler.latestStatsFile()` could help how to find the latest 
snapshot having partition stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to