Thanks, that was exactly what I was looking for. I do think we could offer this or other examples in the spec to make it clear what they represent (including the null count).
I filled ARROW-13349 to track the pyarrow discrepancy. Best, Jorge On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]> wrote: > > > > I don't have any experience in pyarrow but either it writes wrong values > > into these fields or the schema is not the same as the one in your > example. > > > The number of rows from pyarrow is clearly a bug (the code passes > num_values for both). > > I think it might be worth discussing the null count some more. I think > pyarrow is considering only null values at the leaf of the schema which is > why the value is 1. The full comment from the specification says "Number > of non-null = num_values - num_nulls which is also the number of values in > the data section". > > "number of values in the data section" seems to be at odds with counting > nulls at every level, since we only store values when they are non-null at > leaf (empty lists are only stored in repetition/definition level). But I > might be misinterpreting this. If null_count is intended to capture nulls > at any level of the schema it seems we should update the documentation to > be clearer on this point. We should also make the same clarification on > "null_count" for page statistics. > > Thanks, > Micah > > On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]> wrote: > > > Hi Jorge, > > > > Please correct me if I'm wrong but it seems the schema of your column is > > similar to the following: > > optional group column1 (LIST) { > > repeated group list { > > optional int32 element; > > } > > } > > > > Based on the specs in the thrift file > > < > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565 > > > > > : > > > > - num_values: the number of all values (including nulls) in the page. > > This is 6 in your example. > > - num_nulls: the number of null values in the page. The spec says > > "non-null = num_values - num_nulls" so we do not care about the level > of > > the null value only that it is null. So, the correct value for your > > example > > is 2. > > - num_rows: the number of "first level objects" in the page. In other > > words the number of rows for the column in the current page. If the > > column > > is a primitive (not a nested type) this value equals to num_values. In > > your > > example the correct value is 3. > > > > I don't have any experience in pyarrow but either it writes wrong values > > into these fields or the schema is not the same as the one in your > example. > > > > Since compressed_size and num_values are enough for reading a V1 page > they > > shall be enough to read a V2 page as well. The problem is num_nulls and > > num_rows are also required fields of the V2 page header so you must fill > > them with the correct values. > > > > Regards, > > Gabor > > > > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão < > > [email protected]> wrote: > > > > > In the V2 data page header, we have: > > > > > > * num_values > > > * num_rows > > > * num_nulls > > > > > > While on the V1 data page header, we only have "num_values". > > > > > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how > > > should each of these numbers be written in v1 and v2? > > > > > > My current understanding from the docs is that for the example above, > we > > > should write: > > > > > > v2: > > > * num_values: 6 > > > * num_rows: 3 > > > * num_nulls: 2 > > > > > > v1: > > > * num_values: 6 > > > > > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes > > > > > > v2: > > > * num_values: 6 > > > * num_nulls: 1 > > > * num_rows: 6 > > > v1: > > > * num_values: 6 > > > > > > Is there any reference for this? > > > > > > Are the extra numbers in v2 necessary to read a page? My understanding > is > > > that the (compressed_size, uncompressed_size, num_values) is enough for > > > reading everything. > > > > > > Best, > > > Jorge > > > > > >
