Re: num_values vs num_rows vs num_nulls

Micah Kornfield Fri, 16 Jul 2021 00:00:20 -0700

Note I moved the Arrow JIRA under parquet since I think this only affects
the core-parquet part of the implementation.  I also created PARQUET-2067
to track the incorrect null counts (this might actually touch some arrow
code but I did this for consistency).


Thanks,
Micah

On Thu, Jul 15, 2021 at 11:51 PM Micah Kornfield <[email protected]>
wrote:

> Yeah I guess we only ever write 4 values for the example so even though
> the wording is strange in num_values = 6 (which I don't think anyone is
> debating it must be 2).  Still a little confusing.
>
> On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão <
> [email protected]> wrote:
>
>> Thanks, that was exactly what I was looking for.
>>
>> I do think we could offer this or other examples in the spec to make it
>> clear what they represent (including the null count).
>>
>> I filled ARROW-13349 to track the pyarrow discrepancy.
>>
>> Best,
>> Jorge
>>
>>
>> On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> >
>>> > I don't have any experience in pyarrow but either it writes wrong
>>> values
>>> > into these fields or the schema is not the same as the one in your
>>> example.
>>>
>>>
>>>  The number of rows from pyarrow is clearly a bug (the code passes
>>> num_values for both).
>>>
>>> I think it might be worth discussing the null count some more. I think
>>> pyarrow is considering only null values at the leaf of the schema which
>>> is
>>> why the value is 1.   The full comment from the specification says
>>> "Number
>>> of non-null = num_values - num_nulls which is also the number of values
>>> in
>>> the data section".
>>>
>>> "number of values in the data section" seems to be at odds with counting
>>> nulls at every level, since we only store values when they are non-null
>>> at
>>> leaf (empty lists are only stored in repetition/definition level).  But I
>>> might be misinterpreting this.  If null_count is intended to capture
>>> nulls
>>> at any level of the schema it seems we should update the documentation to
>>> be clearer on this point.  We should also make the same clarification on
>>> "null_count" for page statistics.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]>
>>> wrote:
>>>
>>> > Hi Jorge,
>>> >
>>> > Please correct me if I'm wrong but it seems the schema of your column
>>> is
>>> > similar to the following:
>>> > optional group column1 (LIST) {
>>> >   repeated group list {
>>> >     optional int32 element;
>>> >   }
>>> > }
>>> >
>>> > Based on the specs in the thrift file
>>> > <
>>> >
>>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
>>> > >
>>> > :
>>> >
>>> >    - num_values: the number of all values (including nulls) in the
>>> page.
>>> >    This is 6 in your example.
>>> >    - num_nulls: the number of null values in the page. The spec says
>>> >    "non-null = num_values - num_nulls" so we do not care about the
>>> level of
>>> >    the null value only that it is null. So, the correct value for your
>>> > example
>>> >    is 2.
>>> >    - num_rows: the number of "first level objects" in the page. In
>>> other
>>> >    words the number of rows for the column in the current page. If the
>>> > column
>>> >    is a primitive (not a nested type) this value equals to num_values.
>>> In
>>> > your
>>> >    example the correct value is 3.
>>> >
>>> > I don't have any experience in pyarrow but either it writes wrong
>>> values
>>> > into these fields or the schema is not the same as the one in your
>>> example.
>>> >
>>> > Since compressed_size and num_values are enough for reading a V1 page
>>> they
>>> > shall be enough to read a V2 page as well. The problem is num_nulls and
>>> > num_rows are also required fields of the V2 page header so you must
>>> fill
>>> > them with the correct values.
>>> >
>>> > Regards,
>>> > Gabor
>>> >
>>> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
>>> > [email protected]> wrote:
>>> >
>>> > > In the V2 data page header, we have:
>>> > >
>>> > > * num_values
>>> > > * num_rows
>>> > > * num_nulls
>>> > >
>>> > > While on the V1 data page header, we only have "num_values".
>>> > >
>>> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
>>> > > should each of these numbers be written in v1 and v2?
>>> > >
>>> > > My current understanding from the docs is that for the example
>>> above, we
>>> > > should write:
>>> > >
>>> > > v2:
>>> > > * num_values: 6
>>> > > * num_rows: 3
>>> > > * num_nulls: 2
>>> > >
>>> > > v1:
>>> > > * num_values: 6
>>> > >
>>> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>>> > >
>>> > > v2:
>>> > > * num_values: 6
>>> > > * num_nulls: 1
>>> > > * num_rows: 6
>>> > > v1:
>>> > > * num_values: 6
>>> > >
>>> > > Is there any reference for this?
>>> > >
>>> > > Are the extra numbers in v2 necessary to read a page? My
>>> understanding is
>>> > > that the (compressed_size, uncompressed_size, num_values) is enough
>>> for
>>> > > reading everything.
>>> > >
>>> > > Best,
>>> > > Jorge
>>> > >
>>> >
>>>
>>

Re: num_values vs num_rows vs num_nulls

Reply via email to