Thanks, that was exactly what I was looking for.

I do think we could offer this or other examples in the spec to make it
clear what they represent (including the null count).

I filled ARROW-13349 to track the pyarrow discrepancy.

Best,
Jorge


On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]>
wrote:

> >
> > I don't have any experience in pyarrow but either it writes wrong values
> > into these fields or the schema is not the same as the one in your
> example.
>
>
>  The number of rows from pyarrow is clearly a bug (the code passes
> num_values for both).
>
> I think it might be worth discussing the null count some more. I think
> pyarrow is considering only null values at the leaf of the schema which is
> why the value is 1.   The full comment from the specification says "Number
> of non-null = num_values - num_nulls which is also the number of values in
> the data section".
>
> "number of values in the data section" seems to be at odds with counting
> nulls at every level, since we only store values when they are non-null at
> leaf (empty lists are only stored in repetition/definition level).  But I
> might be misinterpreting this.  If null_count is intended to capture nulls
> at any level of the schema it seems we should update the documentation to
> be clearer on this point.  We should also make the same clarification on
> "null_count" for page statistics.
>
> Thanks,
> Micah
>
> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]> wrote:
>
> > Hi Jorge,
> >
> > Please correct me if I'm wrong but it seems the schema of your column is
> > similar to the following:
> > optional group column1 (LIST) {
> >   repeated group list {
> >     optional int32 element;
> >   }
> > }
> >
> > Based on the specs in the thrift file
> > <
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
> > >
> > :
> >
> >    - num_values: the number of all values (including nulls) in the page.
> >    This is 6 in your example.
> >    - num_nulls: the number of null values in the page. The spec says
> >    "non-null = num_values - num_nulls" so we do not care about the level
> of
> >    the null value only that it is null. So, the correct value for your
> > example
> >    is 2.
> >    - num_rows: the number of "first level objects" in the page. In other
> >    words the number of rows for the column in the current page. If the
> > column
> >    is a primitive (not a nested type) this value equals to num_values. In
> > your
> >    example the correct value is 3.
> >
> > I don't have any experience in pyarrow but either it writes wrong values
> > into these fields or the schema is not the same as the one in your
> example.
> >
> > Since compressed_size and num_values are enough for reading a V1 page
> they
> > shall be enough to read a V2 page as well. The problem is num_nulls and
> > num_rows are also required fields of the V2 page header so you must fill
> > them with the correct values.
> >
> > Regards,
> > Gabor
> >
> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
> > [email protected]> wrote:
> >
> > > In the V2 data page header, we have:
> > >
> > > * num_values
> > > * num_rows
> > > * num_nulls
> > >
> > > While on the V1 data page header, we only have "num_values".
> > >
> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
> > > should each of these numbers be written in v1 and v2?
> > >
> > > My current understanding from the docs is that for the example above,
> we
> > > should write:
> > >
> > > v2:
> > > * num_values: 6
> > > * num_rows: 3
> > > * num_nulls: 2
> > >
> > > v1:
> > > * num_values: 6
> > >
> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
> > >
> > > v2:
> > > * num_values: 6
> > > * num_nulls: 1
> > > * num_rows: 6
> > > v1:
> > > * num_values: 6
> > >
> > > Is there any reference for this?
> > >
> > > Are the extra numbers in v2 necessary to read a page? My understanding
> is
> > > that the (compressed_size, uncompressed_size, num_values) is enough for
> > > reading everything.
> > >
> > > Best,
> > > Jorge
> > >
> >
>

Reply via email to