I commented in the Jira. Definitely it is a bug to use solely the
dictionary values for computing the statistics, because while a
dictionary may not have nulls, the dictionary indices certainly may


On Thu, Jul 8, 2021 at 6:18 PM Micah Kornfield <[email protected]> wrote:
>
> Just to clarify by correct statistics you mean null count?  Generally that
> attribute is lazily computed.  I commented on the JIRA, I would guess this
> is an artifact of not looking at observed values when writing dictionary
> encoded data to parquet.  There is another bug opened a little while ago
> now about this not giving tight bounds for values in a given page/row group.
>
>
> On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <[email protected]> wrote:
>
> > Hi,
> >
> > I'm investigating https://issues.apache.org/jira/browse/ARROW-12513.
> > While debugging, I've found that when we create dictionary_
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
> > we lose information about null_count.
> > So data_->null_count != 0 but data_->dictionary->null_count == 0.
> > Later we return an array without correct statistics.
> > My question is this seems to be correct behaviour? Or do we need to return
> > an array with statistics? Or these statistics should have been added
> > to data_->dictionary somewhere else?
> >
> > I wrote a more detailed explanation in the jira issue.
> >
> > --
> > Best regards,
> > Kirill Lykov
> >

Reply via email to