nastra commented on code in PR #13933: URL: https://github.com/apache/iceberg/pull/13933#discussion_r2409887365
########## core/src/main/java/org/apache/iceberg/stats/BaseContentStats.java: ########## @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.stats; + +import java.io.Serializable; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.data.GenericRecord; +import org.apache.iceberg.relocated.com.google.common.base.MoreObjects; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.types.Type; +import org.apache.iceberg.types.Types; + +public class BaseContentStats implements ContentStats, StructLike, Serializable { + + private final List<FieldStats<?>> fieldStats; + + /** Used by Avro reflection to instantiate this class when reading manifest files. */ + public BaseContentStats(Types.StructType projection) { + this.fieldStats = Lists.newArrayListWithCapacity(projection.fields().size()); + for (int i = 0; i < projection.fields().size(); i++) { + Types.NestedField field = projection.fields().get(i); + Preconditions.checkArgument( + field.type().isStructType(), "ColumnStats must contain structs: %s", field.type()); + Types.StructType structType = field.type().asStructType(); + Type type = + null != structType.field("lower_bound") + ? structType.field("lower_bound").type() + : null != structType.field("upper_bound") + ? structType.field("upper_bound").type() + : null; + fieldStats.add( + BaseFieldStats.builder() + .fieldId(StatsUtil.fieldIdForStatsField(field.fieldId())) + .type(type) + .build()); + } + } + + private BaseContentStats(List<FieldStats<?>> fieldStats) { + this.fieldStats = Lists.newArrayList(fieldStats); + } + + @Override + public List<FieldStats<?>> fieldStats() { + return fieldStats; + } + + @Override + public int size() { + return fieldStats.size(); + } + + @Override + public <T> T get(int pos, Class<T> javaClass) { + if (pos < 0 || pos > fieldStats().size() - 1) { + return null; Review Comment: no, we need to return null here. This has to do with the fact of how Avro reads the stats schema and how it writes stats entries. For example if you have a table with two columns, you will end up with a stats schema that holds two nested structs as shown below (one for each table column): ``` 10200: 1: optional struct<10201: column_size: optional long (Total size on disk), 10202: value_count: optional long (Total value count, including null and NaN), 10203: null_value_count: optional long (Total null value count), 10204: nan_value_count: optional long (Total NaN value count), 10205: lower_bound: optional int (Lower bound), 10206: upper_bound: optional int (Upper bound)> ``` ``` 10400: 2: optional struct<10401: column_size: optional long (Total size on disk), 10402: value_count: optional long (Total value count, including null and NaN), 10403: null_value_count: optional long (Total null value count), 10404: nan_value_count: optional long (Total NaN value count), 10405: lower_bound: optional string (Lower bound), 10406: upper_bound: optional string (Upper bound)> ``` Now if Avro only writes stats for the first column (pos = 0), it will later still evaluate the stats schema and look at pos = 1, which would result in an exception, because we don't hold any stats here for this particular column and that's why we return null here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
