+1 to suggestion from Xuwei. That is a common practice to work around a bug if a specific writer version can be detected, though the fix might not look elegant.
Best, Gang On Sat, Aug 17, 2024 at 6:24 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > Got it -- makes sense -- thank you > > On Sat, Aug 17, 2024 at 6:11 AM wish maple <maplewish...@gmail.com> wrote: > > > > If it becomes a practical concern, we could look at the writer > metadata, > > perhaps, to detect files written by older versions of parquet-rs > > > > Yes, as the code I listed currently, parquet-java will check the writer > > version > > and do some checking when the previous implementation has a "bug" or is > > inconsistent > > in some cases. > > > > If the user thinks it's important it's also a way for the parquet-rs. > > > > Best, > > Xuwei Fu > > > > Andrew Lamb <andrewlam...@gmail.com> 于2024年8月17日周六 18:06写道: > > > > > > From my perspective, if arrow-rs can guarantee that "when statistics > > > exist > > > but the num_nulls not exist, it means null_count == 0", we can detect > > > the writer like [1]. > > > > > > I agree with this conclusion > > > > > > However, I don't know how we could make that guarantee, given the > parquet > > > file could come from other writers which could (in theory) produce > > > statistics exist but null_count not exist > > > > > > If it becomes a practical concern, we could look at the writer > metadata, > > > perhaps, to detect files written by older versions of parquet-rs > > > > > > Andrew > > > > > > > > > On Fri, Aug 16, 2024 at 5:57 AM wish maple <maplewish...@gmail.com> > > wrote: > > > > > > > > The (current) behavior has the potential for "wrong results" for > > > systems > > > > > that used parquet-rs to read parquet if for files where there are > > > > actually > > > > > nulls but the null_count field is not set (parquet-rs will report > > there > > > > are > > > > > 0 nulls). > > > > > > > > From my perspective, if arrow-rs can guarantee that "when statistics > > > exist > > > > but the num_nulls not exist, it means null_count == 0", we can detect > > > > the writer like [1]. It's a bit hacking but it works > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L66 > > > > > > > > Best, > > > > Xuwei > > > > > > > > Andrew Lamb <andrewlam...@gmail.com> 于2024年8月16日周五 17:42写道: > > > > > > > > > Thank you Xuwei, > > > > > > > > > > Regarding parquet-rs and writing null_counts, I believe the current > > > > > behavior (not writing null_count where there are exactly 0 nulls) > is > > a > > > > bug > > > > > and I made a PR to align the behavior with what the C/C++ and Java > > > > writers > > > > > do in [1]. > > > > > > > > > > The current behavior I think means that parquet files written by > > > > parquet-rs > > > > > won't have null_count set and thus reading them from other systems > > may > > > > not > > > > > be as fast as if those systems knew there were no nulls. > > > > > > > > > > Applications that use parquet-rs to read parquet_files and > interpret > > > the > > > > > null_count will need to be changed after the upgrade to explicitly > > > > continue > > > > > the old behavior of "treat no null_count as 0" which is also > > documented > > > > > now. > > > > > > > > > > The (current) behavior has the potential for "wrong results" for > > > systems > > > > > that used parquet-rs to read parquet if for files where there are > > > > actually > > > > > nulls but the null_count field is not set (parquet-rs will report > > there > > > > are > > > > > 0 nulls). However we haven't had any bug reports and none of the > open > > > > > source writers write the statistics struct without also writing > > > > > null_counts. > > > > > > > > > > Hope that help, > > > > > andrew > > > > > > > > > > > > > > > > > > > > > > > > > [1]: https://github.com/apache/arrow-rs/pull/6257 > > > > > > > > > > On Fri, Aug 16, 2024 at 1:47 AM wish maple <maplewish...@gmail.com > > > > > > wrote: > > > > > > > > > > > Currently in our Parquet format, we have multiple null_count and > > > > > > distinct_count: > > > > > > > > > > > > 1. Statistics::null_count, which is an optional null-count > > > > > > 2. ColumnIndex::null_counts, which is similar to > > > > Statistics::null_count, > > > > > > but storing > > > > > > in page index > > > > > > 3. DataPageHeaderV2::num_nulls, which means "null values count" > in > > a > > > > data > > > > > > page > > > > > > 4. Statistics::distinct_count, which is an optional > distinct-count > > > > > > > > > > > > I've checked the implementation in Parquet-C++, Parquet-Java and > > > > > > parquet-rs, for > > > > > > null-count: > > > > > > On writer side: > > > > > > * Parquet-Java and Parquet-C++ would always write null_count, > even > > > > > > if the null_count is 0 or the column is a non-nullable column > > > > > > * Parquet-rs would not write null_count if null count is 0 > > > previously. > > > > > This > > > > > > is likely to be > > > > > > fixed in [1] > > > > > > > > > > > > The column-index would be similar. > > > > > > > > > > > > On reader side: > > > > > > * Parquet-java requires `null_count` to be set, otherwise it > would > > > > regard > > > > > > the statistics as > > > > > > "might contains null or not" [2] > > > > > > * Parquet-rs regard `num_nulls > 0` as has_nulls, and don't check > > the > > > > > > existence of null > > > > > > [3]. The same properties is `num_nulls >= 0` in parquet-java > [4] > > > > > > > > > > > > For num_nulls, I suggest: > > > > > > 1. Writer side should better write num_nulls / null_count even > when > > > > > > num_nulls is > > > > > > 0 or column is not nullable > > > > > > 2. Reader should distinguish whether the null-count is set or > not. > > > When > > > > > > reading a > > > > > > file from parquet-rs. We can convert num-nulls = 0 when it's > > not > > > > set? > > > > > > > > > > > > distinct_count is more weird in this. I‘ve checked this and find > > > > there're > > > > > > merely > > > > > > implementations that use this. So I wonder > > > > > > 1. Would this be exact? > > > > > > 2. Is there any use-cases for this? > > > > > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/6256 > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L93 > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/parquet/src/file/statistics.rs#L401 > > > > > > [4] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L532 > > > > > > > > > > > > Best, > > > > > > Xuwei Fu > > > > > > > > > > > > > > > > > > > > >