amogh-jahagirdar commented on code in PR #12450: URL: https://github.com/apache/iceberg/pull/12450#discussion_r2179062769
########## api/src/main/java/org/apache/iceberg/actions/ComputePartitionStats.java: ########## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.actions; + +import org.apache.iceberg.PartitionStatisticsFile; + +/** + * An action that computes and writes the partition statistics of an Iceberg table. Current snapshot + * is used by default. + */ +public interface ComputePartitionStats + extends Action<ComputePartitionStats, ComputePartitionStats.Result> { Review Comment: @ajantha-bhat >Did we think about what happens if one of the compute failed? If table stats compute failed but partition stats compute has succeeded and vice versa. Yeah my take is that stats computation should be best effort; so let's say someone configures both table and partition stats to run, and whichever one that runs first fails we just move on to the next. Of course at the end if there's any kind of failure we should clearly indicate that to the user, but I think the partial failure case you're referencing can be handled by just treating all stats generation as best effort in the implementation. > We have to add lot of validations like what of user configures columns and calls withOnlyPartitionStats I guess I'm not understanding why the only additional validation would not just be `Preconditions.checkArgument(columns.isEmpty())` in the `withOnlyPartitionStats`? Are there any validations beyond this? @RussellSpitzer My breakdown of the pros/cons, @ajantha-bhat feel free to add stuff in case I'm missing something Combined action with different options: Pros: 1. I think this better aligns with what exists in other engines like [Hive](https://cwiki.apache.org/confluence/display/hive/statsdev#StatsDev-Examples) [Spark](https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html) [Trino](https://trino.io/docs/current/sql/analyze.html#examples) where column stats and partition stats are effectively collected through the same SQL. I don't really see a distinction here between the procedure/action. I feel like we should generally try and keep the action API surface area reduced, just as we do with the procedure, it's just that the procedure is SQL oriented instead of programatic. 2. I _think_ In the long run many of the options that apply for partition stats will also apply for normal table stats. e.g. I think we will also probably want a force refresh; we will also probably want incrementally computing. So if we know much of the same options will apply at some point it seems better to just coalesce the API here. Cons: 1. There's a bit of complexity for validation exposing multiple options on the API that won't work together (like partitionStatsOnly and specifying columns). I'm doubtful how complex this really is, this just seems like a simple preconditions check but I could be missing something. The main argument I can think for separate procedures is for people familiar with the details of the statistics and their granularities, then having separate procedures is clearer since at least as of today they are different. The main arguments that I have against separate procedures is that if we consider that in the long run the options will converge it seems compelling to keep it combined (just going back to my feeling to keep as minimal of an API surface area as possible). I think it'd be better to do this to begin with rather than separate and then undo it later. In the end, if folks feel strongly about keeping it separate I think I'm OK, I just want to make sure we're being thoughtful about the implications of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
