Hi Jorge, Spark (similarly to other jvm based implementations) are most probably using parquet-mr. parquet-mr counts null values independently from the level in the structure. An additional twist here is we cannot store empty lists but null lists (when the list itself is null) if it is optional. That's why Spark reports 4 here.
Cheers, Gabor On Fri, Jul 16, 2021 at 3:29 PM Jorge Cardoso Leitão < [email protected]> wrote: > (Branching from the previous discussion, as Micah pointed out another > interesting aspect) > > Consider the list > > [[0, 1], None, [2, None, 3], [4, 5, 6], [], [7, 8, 9], None, [10]] > > for the schema > > optional group column1 (LIST) { > repeated group list { > optional int32 element; > } > } > > When looking at the row group statistics, pyarrow 4 seems to report a null > count of 1 while spark 3 reports a null count of 4 (see attached script for > the writing and reading of the statistics). > > I am a bit lost on which should be the intended result. Isn't spark using > the official Java implementation? a null count of 4 seems a bit odd in the > example above. > > Best, > Jorge > > >
