Hi Jorge,

Please correct me if I'm wrong but it seems the schema of your column is
similar to the following:
optional group column1 (LIST) {
  repeated group list {
    optional int32 element;
  }
}

Based on the specs in the thrift file
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565>
:

   - num_values: the number of all values (including nulls) in the page.
   This is 6 in your example.
   - num_nulls: the number of null values in the page. The spec says
   "non-null = num_values - num_nulls" so we do not care about the level of
   the null value only that it is null. So, the correct value for your example
   is 2.
   - num_rows: the number of "first level objects" in the page. In other
   words the number of rows for the column in the current page. If the column
   is a primitive (not a nested type) this value equals to num_values. In your
   example the correct value is 3.

I don't have any experience in pyarrow but either it writes wrong values
into these fields or the schema is not the same as the one in your example.

Since compressed_size and num_values are enough for reading a V1 page they
shall be enough to read a V2 page as well. The problem is num_nulls and
num_rows are also required fields of the V2 page header so you must fill
them with the correct values.

Regards,
Gabor

On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
[email protected]> wrote:

> In the V2 data page header, we have:
>
> * num_values
> * num_rows
> * num_nulls
>
> While on the V1 data page header, we only have "num_values".
>
> On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
> should each of these numbers be written in v1 and v2?
>
> My current understanding from the docs is that for the example above, we
> should write:
>
> v2:
> * num_values: 6
> * num_rows: 3
> * num_nulls: 2
>
> v1:
> * num_values: 6
>
> But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>
> v2:
> * num_values: 6
> * num_nulls: 1
> * num_rows: 6
> v1:
> * num_values: 6
>
> Is there any reference for this?
>
> Are the extra numbers in v2 necessary to read a page? My understanding is
> that the (compressed_size, uncompressed_size, num_values) is enough for
> reading everything.
>
> Best,
> Jorge
>

Reply via email to