Just to clarify by correct statistics you mean null count? Generally that attribute is lazily computed. I commented on the JIRA, I would guess this is an artifact of not looking at observed values when writing dictionary encoded data to parquet. There is another bug opened a little while ago now about this not giving tight bounds for values in a given page/row group.
On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <[email protected]> wrote: > Hi, > > I'm investigating https://issues.apache.org/jira/browse/ARROW-12513. > While debugging, I've found that when we create dictionary_ > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111 > we lose information about null_count. > So data_->null_count != 0 but data_->dictionary->null_count == 0. > Later we return an array without correct statistics. > My question is this seems to be correct behaviour? Or do we need to return > an array with statistics? Or these statistics should have been added > to data_->dictionary somewhere else? > > I wrote a more detailed explanation in the jira issue. > > -- > Best regards, > Kirill Lykov >
