Re: statistics null count in nested types

Gabor Szadovszky Fri, 16 Jul 2021 07:02:55 -0700

Hi Jorge,

Spark (similarly to other jvm based implementations) are most probably
using parquet-mr. parquet-mr counts null values independently from the
level in the structure. An additional twist here is we cannot store empty
lists but null lists (when the list itself is null) if it is optional.
That's why Spark reports 4 here.


Cheers,
Gabor

On Fri, Jul 16, 2021 at 3:29 PM Jorge Cardoso Leitão <
[email protected]> wrote:

> (Branching from the previous discussion, as Micah pointed out another
> interesting aspect)
>
> Consider the list
>
> [[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]]
>
> for the schema
>
> optional group column1 (LIST) {
>   repeated group list {
>     optional int32 element;
>   }
> }
>
> When looking at the row group statistics, pyarrow 4 seems to report a null
> count of 1 while spark 3 reports a null count of 4 (see attached script for
> the writing and reading of the statistics).
>
> I am a bit lost on which should be the intended result. Isn't spark using
> the official Java implementation? a null count of 4 seems a bit odd in the
> example above.
>
> Best,
> Jorge
>
>
>

Re: statistics null count in nested types

Reply via email to