> > I don't have any experience in pyarrow but either it writes wrong values > into these fields or the schema is not the same as the one in your example.
The number of rows from pyarrow is clearly a bug (the code passes num_values for both). I think it might be worth discussing the null count some more. I think pyarrow is considering only null values at the leaf of the schema which is why the value is 1. The full comment from the specification says "Number of non-null = num_values - num_nulls which is also the number of values in the data section". "number of values in the data section" seems to be at odds with counting nulls at every level, since we only store values when they are non-null at leaf (empty lists are only stored in repetition/definition level). But I might be misinterpreting this. If null_count is intended to capture nulls at any level of the schema it seems we should update the documentation to be clearer on this point. We should also make the same clarification on "null_count" for page statistics. Thanks, Micah On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]> wrote: > Hi Jorge, > > Please correct me if I'm wrong but it seems the schema of your column is > similar to the following: > optional group column1 (LIST) { > repeated group list { > optional int32 element; > } > } > > Based on the specs in the thrift file > < > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565 > > > : > > - num_values: the number of all values (including nulls) in the page. > This is 6 in your example. > - num_nulls: the number of null values in the page. The spec says > "non-null = num_values - num_nulls" so we do not care about the level of > the null value only that it is null. So, the correct value for your > example > is 2. > - num_rows: the number of "first level objects" in the page. In other > words the number of rows for the column in the current page. If the > column > is a primitive (not a nested type) this value equals to num_values. In > your > example the correct value is 3. > > I don't have any experience in pyarrow but either it writes wrong values > into these fields or the schema is not the same as the one in your example. > > Since compressed_size and num_values are enough for reading a V1 page they > shall be enough to read a V2 page as well. The problem is num_nulls and > num_rows are also required fields of the V2 page header so you must fill > them with the correct values. > > Regards, > Gabor > > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão < > [email protected]> wrote: > > > In the V2 data page header, we have: > > > > * num_values > > * num_rows > > * num_nulls > > > > While on the V1 data page header, we only have "num_values". > > > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how > > should each of these numbers be written in v1 and v2? > > > > My current understanding from the docs is that for the example above, we > > should write: > > > > v2: > > * num_values: 6 > > * num_rows: 3 > > * num_nulls: 2 > > > > v1: > > * num_values: 6 > > > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes > > > > v2: > > * num_values: 6 > > * num_nulls: 1 > > * num_rows: 6 > > v1: > > * num_values: 6 > > > > Is there any reference for this? > > > > Are the extra numbers in v2 necessary to read a page? My understanding is > > that the (compressed_size, uncompressed_size, num_values) is enough for > > reading everything. > > > > Best, > > Jorge > > >
