I commented in the Jira. Definitely it is a bug to use solely the dictionary values for computing the statistics, because while a dictionary may not have nulls, the dictionary indices certainly may
On Thu, Jul 8, 2021 at 6:18 PM Micah Kornfield <[email protected]> wrote: > > Just to clarify by correct statistics you mean null count? Generally that > attribute is lazily computed. I commented on the JIRA, I would guess this > is an artifact of not looking at observed values when writing dictionary > encoded data to parquet. There is another bug opened a little while ago > now about this not giving tight bounds for values in a given page/row group. > > > On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <[email protected]> wrote: > > > Hi, > > > > I'm investigating https://issues.apache.org/jira/browse/ARROW-12513. > > While debugging, I've found that when we create dictionary_ > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111 > > we lose information about null_count. > > So data_->null_count != 0 but data_->dictionary->null_count == 0. > > Later we return an array without correct statistics. > > My question is this seems to be correct behaviour? Or do we need to return > > an array with statistics? Or these statistics should have been added > > to data_->dictionary somewhere else? > > > > I wrote a more detailed explanation in the jira issue. > > > > -- > > Best regards, > > Kirill Lykov > >
