Re: [PR] Spark-3.5: Add spark action to compute partition stats [iceberg]

via GitHub Tue, 01 Jul 2025 21:50:05 -0700


amogh-jahagirdar commented on code in PR #12450:
URL: https://github.com/apache/iceberg/pull/12450#discussion_r2179062769



##########
api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java:
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.PartitionStatisticsFile;
+
+/**
+ * An action that computes and writes the partition statistics of an Iceberg 
table. Current snapshot
+ * is used by default.
+ */
+public interface ComputePartitionStats
+    extends Action<ComputePartitionStats, ComputePartitionStats.Result> {

Review Comment:
   @ajantha-bhat 
   
   >Did we think about what happens if one of the compute failed? If table 
stats compute failed but partition stats compute has succeeded and vice versa.
   
   Yeah my take is that stats computation should be best effort; so let's say 
someone configures both table and partition stats to run, and whichever one 
that runs first fails we just move on to the next. Of course at the end if 
there's any kind of failure we should clearly indicate that to the user, but I 
think the partial failure case you're referencing can be handled by just 
treating all stats generation as best effort in the implementation.
   
   > We have to add lot of validations like what of user configures columns and 
calls withOnlyPartitionStats
   I guess I'm not understanding why the only additional validation would not 
just be `Preconditions.checkArgument(columns.isEmpty())` in the 
`withOnlyPartitionStats`? Are there any validations beyond this?  
   
   @RussellSpitzer 
   My breakdown of the pros/cons, @ajantha-bhat feel free to add stuff in case 
I'm missing something
   
   Combined action with different options:
   
   Pros:
   1. I think this better aligns with what exists in other engines like 
   
[Hive](https://cwiki.apache.org/confluence/display/hive/statsdev#StatsDev-Examples)
   
[Spark](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html)
   [Trino](https://trino.io/docs/current/sql/analyze.html#examples)
   where column stats and partition stats are effectively collected through the 
same SQL.
   
   I don't really see a distinction here between the procedure/action. I feel 
like we should generally try
   and keep the action API surface area reduced, just as we do with the 
procedure, it's just that the procedure is SQL oriented instead of programatic.
   
   2.  I _think_ In the long run many of the options that apply for partition 
stats will also apply for normal table stats. e.g. I think we will also 
probably want a force refresh; we will also probably want incrementally 
computing. So if we know much of the same options will apply at some point it 
seems better to just coalesce the API here.
   
   Cons:
   
   1. There's a bit of complexity for validation exposing multiple options on 
the API that won't work together (like partitionStatsOnly and specifying 
columns). I'm doubtful how complex this really is, this just seems like a 
simple preconditions check but I could be missing something.
   
   
   The main argument I can think for separate procedures is for people familiar 
with the details of the statistics and their granularities, then having 
separate procedures is clearer since at least as of today they are different. 
   
   The main arguments that I have against separate procedures is that if we 
consider that in the long run the options will converge it seems compelling to 
keep it combined (just going back to my feeling to keep as minimal of an API 
surface area as possible). I think it'd be better to do this to begin with 
rather than separate and then undo it later.
   
   In the end, if folks feel strongly about keeping it separate I think I'm OK, 
I just want to make sure we're being thoughtful about the implications of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark-3.5: Add spark action to compute partition stats [iceberg]

Reply via email to