Just to clarify by correct statistics you mean null count?  Generally that
attribute is lazily computed.  I commented on the JIRA, I would guess this
is an artifact of not looking at observed values when writing dictionary
encoded data to parquet.  There is another bug opened a little while ago
now about this not giving tight bounds for values in a given page/row group.


On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <[email protected]> wrote:

> Hi,
>
> I'm investigating https://issues.apache.org/jira/browse/ARROW-12513.
> While debugging, I've found that when we create dictionary_
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
> we lose information about null_count.
> So data_->null_count != 0 but data_->dictionary->null_count == 0.
> Later we return an array without correct statistics.
> My question is this seems to be correct behaviour? Or do we need to return
> an array with statistics? Or these statistics should have been added
> to data_->dictionary somewhere else?
>
> I wrote a more detailed explanation in the jira issue.
>
> --
> Best regards,
> Kirill Lykov
>

Reply via email to