rdblue commented on code in PR #5794: URL: https://github.com/apache/iceberg/pull/5794#discussion_r979009386
########## core/src/main/java/org/apache/iceberg/SetStatistics.java: ########## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.stream.Stream; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; + +public class SetStatistics implements UpdateStatistics { + private final TableOperations ops; + private final Map<Long, Optional<StatisticsFile>> statisticsToSet = Maps.newHashMap(); + + public SetStatistics(TableOperations ops) { + this.ops = ops; + } + + @Override + public UpdateStatistics setStatistics(long snapshotId, StatisticsFile statisticsFile) { + Preconditions.checkArgument(snapshotId == statisticsFile.snapshotId()); + statisticsToSet.put(snapshotId, Optional.of(statisticsFile)); + return this; + } + + @Override + public UpdateStatistics removeStatistics(long snapshotId) { + statisticsToSet.put(snapshotId, Optional.empty()); + return this; + } + + @Override + public List<StatisticsFile> apply() { + return Stream.concat( + // Retained statistics + ops.current().statisticsFiles().stream() + .filter( + statisticsFile -> !statisticsToSet.containsKey(statisticsFile.snapshotId())), + // New statistics + statisticsToSet.values().stream().filter(Optional::isPresent).map(Optional::get)) + .collect(ImmutableList.toImmutableList()); + } + + @Override + public void commit() { + TableMetadata base = ops.current(); // or ops.refresh() ? Review Comment: There are a couple questions: 1. Should a commit call `refresh` or attempt with the current metadata? 2. Should refresh be called in `apply` or in `commit`? For 1, what matters is whether the operation can retry and how expensive that retry is. `SchemaUpdate` can't retry because it tracks changes based on the current schema. For example, `updateColumn("id", LongType.get())` will track a replacement `id` field with its comment unchanged. If a concurrent update modifies the comment, then committing the changes would inadvertently revert the concurrent change. That's why `SchemaUpdate` doesn't retry. As a result, `current` must be used to swap for the state when the update was started. Operations that support retry typically use `refresh` instead of `current` so that changes are based on the latest state of the table. That is basically to prevent wasted work when Iceberg can detect that a retry would be necessary. This is why `SnapshotProducer` calls `refresh`. Many changes that can be retried don't have expensive retries and it's better to avoid calling `refresh`. `PropertiesUpdate`, for example, is a metadata only change so it's better to attempt with the current metadata to avoid an extra call to the catalog service. For 2, it depends on how `apply` is used. If `apply` is called by `commit` then you'll probably see the `refresh` call there. But if `apply` and `commit` depend on common logic, `apply` will generally not call `refresh` (because it is a preview) and `commit` will call it. Hopefully that clarifies why there are differences! Here, the changes are idempotent and only affect metadata so I think we should use `current` instead of `refresh`, like the `PropertiesUpdate` operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
