Yeah I guess we only ever write 4 values for the example so even though the wording is strange in num_values = 6 (which I don't think anyone is debating it must be 2). Still a little confusing.
On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão < [email protected]> wrote: > Thanks, that was exactly what I was looking for. > > I do think we could offer this or other examples in the spec to make it > clear what they represent (including the null count). > > I filled ARROW-13349 to track the pyarrow discrepancy. > > Best, > Jorge > > > On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]> > wrote: > >> > >> > I don't have any experience in pyarrow but either it writes wrong values >> > into these fields or the schema is not the same as the one in your >> example. >> >> >> The number of rows from pyarrow is clearly a bug (the code passes >> num_values for both). >> >> I think it might be worth discussing the null count some more. I think >> pyarrow is considering only null values at the leaf of the schema which is >> why the value is 1. The full comment from the specification says "Number >> of non-null = num_values - num_nulls which is also the number of values in >> the data section". >> >> "number of values in the data section" seems to be at odds with counting >> nulls at every level, since we only store values when they are non-null at >> leaf (empty lists are only stored in repetition/definition level). But I >> might be misinterpreting this. If null_count is intended to capture nulls >> at any level of the schema it seems we should update the documentation to >> be clearer on this point. We should also make the same clarification on >> "null_count" for page statistics. >> >> Thanks, >> Micah >> >> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]> >> wrote: >> >> > Hi Jorge, >> > >> > Please correct me if I'm wrong but it seems the schema of your column is >> > similar to the following: >> > optional group column1 (LIST) { >> > repeated group list { >> > optional int32 element; >> > } >> > } >> > >> > Based on the specs in the thrift file >> > < >> > >> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565 >> > > >> > : >> > >> > - num_values: the number of all values (including nulls) in the page. >> > This is 6 in your example. >> > - num_nulls: the number of null values in the page. The spec says >> > "non-null = num_values - num_nulls" so we do not care about the >> level of >> > the null value only that it is null. So, the correct value for your >> > example >> > is 2. >> > - num_rows: the number of "first level objects" in the page. In other >> > words the number of rows for the column in the current page. If the >> > column >> > is a primitive (not a nested type) this value equals to num_values. >> In >> > your >> > example the correct value is 3. >> > >> > I don't have any experience in pyarrow but either it writes wrong values >> > into these fields or the schema is not the same as the one in your >> example. >> > >> > Since compressed_size and num_values are enough for reading a V1 page >> they >> > shall be enough to read a V2 page as well. The problem is num_nulls and >> > num_rows are also required fields of the V2 page header so you must fill >> > them with the correct values. >> > >> > Regards, >> > Gabor >> > >> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão < >> > [email protected]> wrote: >> > >> > > In the V2 data page header, we have: >> > > >> > > * num_values >> > > * num_rows >> > > * num_nulls >> > > >> > > While on the V1 data page header, we only have "num_values". >> > > >> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how >> > > should each of these numbers be written in v1 and v2? >> > > >> > > My current understanding from the docs is that for the example above, >> we >> > > should write: >> > > >> > > v2: >> > > * num_values: 6 >> > > * num_rows: 3 >> > > * num_nulls: 2 >> > > >> > > v1: >> > > * num_values: 6 >> > > >> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes >> > > >> > > v2: >> > > * num_values: 6 >> > > * num_nulls: 1 >> > > * num_rows: 6 >> > > v1: >> > > * num_values: 6 >> > > >> > > Is there any reference for this? >> > > >> > > Are the extra numbers in v2 necessary to read a page? My >> understanding is >> > > that the (compressed_size, uncompressed_size, num_values) is enough >> for >> > > reading everything. >> > > >> > > Best, >> > > Jorge >> > > >> > >> >
