Yeah I guess we only ever write 4 values for the example so even though the
wording is strange in num_values = 6 (which I don't think anyone is
debating it must be 2).  Still a little confusing.

On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão <
[email protected]> wrote:

> Thanks, that was exactly what I was looking for.
>
> I do think we could offer this or other examples in the spec to make it
> clear what they represent (including the null count).
>
> I filled ARROW-13349 to track the pyarrow discrepancy.
>
> Best,
> Jorge
>
>
> On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]>
> wrote:
>
>> >
>> > I don't have any experience in pyarrow but either it writes wrong values
>> > into these fields or the schema is not the same as the one in your
>> example.
>>
>>
>>  The number of rows from pyarrow is clearly a bug (the code passes
>> num_values for both).
>>
>> I think it might be worth discussing the null count some more. I think
>> pyarrow is considering only null values at the leaf of the schema which is
>> why the value is 1.   The full comment from the specification says "Number
>> of non-null = num_values - num_nulls which is also the number of values in
>> the data section".
>>
>> "number of values in the data section" seems to be at odds with counting
>> nulls at every level, since we only store values when they are non-null at
>> leaf (empty lists are only stored in repetition/definition level).  But I
>> might be misinterpreting this.  If null_count is intended to capture nulls
>> at any level of the schema it seems we should update the documentation to
>> be clearer on this point.  We should also make the same clarification on
>> "null_count" for page statistics.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]>
>> wrote:
>>
>> > Hi Jorge,
>> >
>> > Please correct me if I'm wrong but it seems the schema of your column is
>> > similar to the following:
>> > optional group column1 (LIST) {
>> >   repeated group list {
>> >     optional int32 element;
>> >   }
>> > }
>> >
>> > Based on the specs in the thrift file
>> > <
>> >
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
>> > >
>> > :
>> >
>> >    - num_values: the number of all values (including nulls) in the page.
>> >    This is 6 in your example.
>> >    - num_nulls: the number of null values in the page. The spec says
>> >    "non-null = num_values - num_nulls" so we do not care about the
>> level of
>> >    the null value only that it is null. So, the correct value for your
>> > example
>> >    is 2.
>> >    - num_rows: the number of "first level objects" in the page. In other
>> >    words the number of rows for the column in the current page. If the
>> > column
>> >    is a primitive (not a nested type) this value equals to num_values.
>> In
>> > your
>> >    example the correct value is 3.
>> >
>> > I don't have any experience in pyarrow but either it writes wrong values
>> > into these fields or the schema is not the same as the one in your
>> example.
>> >
>> > Since compressed_size and num_values are enough for reading a V1 page
>> they
>> > shall be enough to read a V2 page as well. The problem is num_nulls and
>> > num_rows are also required fields of the V2 page header so you must fill
>> > them with the correct values.
>> >
>> > Regards,
>> > Gabor
>> >
>> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
>> > [email protected]> wrote:
>> >
>> > > In the V2 data page header, we have:
>> > >
>> > > * num_values
>> > > * num_rows
>> > > * num_nulls
>> > >
>> > > While on the V1 data page header, we only have "num_values".
>> > >
>> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
>> > > should each of these numbers be written in v1 and v2?
>> > >
>> > > My current understanding from the docs is that for the example above,
>> we
>> > > should write:
>> > >
>> > > v2:
>> > > * num_values: 6
>> > > * num_rows: 3
>> > > * num_nulls: 2
>> > >
>> > > v1:
>> > > * num_values: 6
>> > >
>> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>> > >
>> > > v2:
>> > > * num_values: 6
>> > > * num_nulls: 1
>> > > * num_rows: 6
>> > > v1:
>> > > * num_values: 6
>> > >
>> > > Is there any reference for this?
>> > >
>> > > Are the extra numbers in v2 necessary to read a page? My
>> understanding is
>> > > that the (compressed_size, uncompressed_size, num_values) is enough
>> for
>> > > reading everything.
>> > >
>> > > Best,
>> > > Jorge
>> > >
>> >
>>
>

Reply via email to