https://github.com/apache/parquet-format/pull/449
I've draft a pull request here Best, Xuwei Fu wish maple <maplewish...@gmail.com> 于2024年8月16日周五 13:46写道: > Currently in our Parquet format, we have multiple null_count and > distinct_count: > > 1. Statistics::null_count, which is an optional null-count > 2. ColumnIndex::null_counts, which is similar to Statistics::null_count, > but storing > in page index > 3. DataPageHeaderV2::num_nulls, which means "null values count" in a data > page > 4. Statistics::distinct_count, which is an optional distinct-count > > I've checked the implementation in Parquet-C++, Parquet-Java and > parquet-rs, for > null-count: > On writer side: > * Parquet-Java and Parquet-C++ would always write null_count, even > if the null_count is 0 or the column is a non-nullable column > * Parquet-rs would not write null_count if null count is 0 previously. > This is likely to be > fixed in [1] > > The column-index would be similar. > > On reader side: > * Parquet-java requires `null_count` to be set, otherwise it would regard > the statistics as > "might contains null or not" [2] > * Parquet-rs regard `num_nulls > 0` as has_nulls, and don't check the > existence of null > [3]. The same properties is `num_nulls >= 0` in parquet-java [4] > > For num_nulls, I suggest: > 1. Writer side should better write num_nulls / null_count even when > num_nulls is > 0 or column is not nullable > 2. Reader should distinguish whether the null-count is set or not. When > reading a > file from parquet-rs. We can convert num-nulls = 0 when it's not set? > > distinct_count is more weird in this. I‘ve checked this and find there're > merely > implementations that use this. So I wonder > 1. Would this be exact? > 2. Is there any use-cases for this? > > [1] https://github.com/apache/arrow-rs/issues/6256 > [2] > https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L93 > [3] > https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/parquet/src/file/statistics.rs#L401 > [4] > https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L532 > > Best, > Xuwei Fu >