[DISCUSS] Clarify num_nulls(null_counts) and distinct_counts in Parquet statistics

wish maple Thu, 15 Aug 2024 22:47:53 -0700

Currently in our Parquet format, we have multiple null_count and
distinct_count:


1. Statistics::null_count, which is an optional null-count
2. ColumnIndex::null_counts, which is similar to Statistics::null_count,
but storing
    in page index
3. DataPageHeaderV2::num_nulls, which means "null values count" in a data
page
4. Statistics::distinct_count, which is an optional distinct-count

I've checked the implementation in Parquet-C++, Parquet-Java and
parquet-rs, for
null-count:
On writer side:
* Parquet-Java and Parquet-C++ would always write null_count, even
   if the null_count is 0 or the column is a non-nullable column
* Parquet-rs would not write null_count if null count is 0 previously. This
is likely to be
   fixed in [1]

The column-index would be similar.

On reader side:
* Parquet-java requires `null_count` to be set, otherwise it would regard
the statistics as
   "might contains null or not" [2]
* Parquet-rs regard `num_nulls > 0` as has_nulls, and don't check the
existence of null
  [3]. The same properties is `num_nulls >= 0` in parquet-java [4]

For num_nulls, I suggest:
1. Writer side should better write num_nulls / null_count even when
num_nulls is
    0 or column is not nullable
2. Reader should distinguish whether the null-count is set or not. When
reading a
    file from parquet-rs. We can convert num-nulls = 0 when it's not set?

distinct_count is more weird in this. I‘ve checked this and find there're
merely
implementations that use this. So I wonder
1. Would this be exact?
2. Is there any use-cases for this?

[1] https://github.com/apache/arrow-rs/issues/6256
[2]
https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L93
[3]
https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/parquet/src/file/statistics.rs#L401
[4]
https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L532

Best,
Xuwei Fu

[DISCUSS] Clarify num_nulls(null_counts) and distinct_counts in Parquet statistics

Reply via email to