Note I moved the Arrow JIRA under parquet since I think this only affects the core-parquet part of the implementation. I also created PARQUET-2067 to track the incorrect null counts (this might actually touch some arrow code but I did this for consistency).
Thanks, Micah On Thu, Jul 15, 2021 at 11:51 PM Micah Kornfield <[email protected]> wrote: > Yeah I guess we only ever write 4 values for the example so even though > the wording is strange in num_values = 6 (which I don't think anyone is > debating it must be 2). Still a little confusing. > > On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão < > [email protected]> wrote: > >> Thanks, that was exactly what I was looking for. >> >> I do think we could offer this or other examples in the spec to make it >> clear what they represent (including the null count). >> >> I filled ARROW-13349 to track the pyarrow discrepancy. >> >> Best, >> Jorge >> >> >> On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <[email protected]> >> wrote: >> >>> > >>> > I don't have any experience in pyarrow but either it writes wrong >>> values >>> > into these fields or the schema is not the same as the one in your >>> example. >>> >>> >>> The number of rows from pyarrow is clearly a bug (the code passes >>> num_values for both). >>> >>> I think it might be worth discussing the null count some more. I think >>> pyarrow is considering only null values at the leaf of the schema which >>> is >>> why the value is 1. The full comment from the specification says >>> "Number >>> of non-null = num_values - num_nulls which is also the number of values >>> in >>> the data section". >>> >>> "number of values in the data section" seems to be at odds with counting >>> nulls at every level, since we only store values when they are non-null >>> at >>> leaf (empty lists are only stored in repetition/definition level). But I >>> might be misinterpreting this. If null_count is intended to capture >>> nulls >>> at any level of the schema it seems we should update the documentation to >>> be clearer on this point. We should also make the same clarification on >>> "null_count" for page statistics. >>> >>> Thanks, >>> Micah >>> >>> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <[email protected]> >>> wrote: >>> >>> > Hi Jorge, >>> > >>> > Please correct me if I'm wrong but it seems the schema of your column >>> is >>> > similar to the following: >>> > optional group column1 (LIST) { >>> > repeated group list { >>> > optional int32 element; >>> > } >>> > } >>> > >>> > Based on the specs in the thrift file >>> > < >>> > >>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565 >>> > > >>> > : >>> > >>> > - num_values: the number of all values (including nulls) in the >>> page. >>> > This is 6 in your example. >>> > - num_nulls: the number of null values in the page. The spec says >>> > "non-null = num_values - num_nulls" so we do not care about the >>> level of >>> > the null value only that it is null. So, the correct value for your >>> > example >>> > is 2. >>> > - num_rows: the number of "first level objects" in the page. In >>> other >>> > words the number of rows for the column in the current page. If the >>> > column >>> > is a primitive (not a nested type) this value equals to num_values. >>> In >>> > your >>> > example the correct value is 3. >>> > >>> > I don't have any experience in pyarrow but either it writes wrong >>> values >>> > into these fields or the schema is not the same as the one in your >>> example. >>> > >>> > Since compressed_size and num_values are enough for reading a V1 page >>> they >>> > shall be enough to read a V2 page as well. The problem is num_nulls and >>> > num_rows are also required fields of the V2 page header so you must >>> fill >>> > them with the correct values. >>> > >>> > Regards, >>> > Gabor >>> > >>> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão < >>> > [email protected]> wrote: >>> > >>> > > In the V2 data page header, we have: >>> > > >>> > > * num_values >>> > > * num_rows >>> > > * num_nulls >>> > > >>> > > While on the V1 data page header, we only have "num_values". >>> > > >>> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how >>> > > should each of these numbers be written in v1 and v2? >>> > > >>> > > My current understanding from the docs is that for the example >>> above, we >>> > > should write: >>> > > >>> > > v2: >>> > > * num_values: 6 >>> > > * num_rows: 3 >>> > > * num_nulls: 2 >>> > > >>> > > v1: >>> > > * num_values: 6 >>> > > >>> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes >>> > > >>> > > v2: >>> > > * num_values: 6 >>> > > * num_nulls: 1 >>> > > * num_rows: 6 >>> > > v1: >>> > > * num_values: 6 >>> > > >>> > > Is there any reference for this? >>> > > >>> > > Are the extra numbers in v2 necessary to read a page? My >>> understanding is >>> > > that the (compressed_size, uncompressed_size, num_values) is enough >>> for >>> > > reading everything. >>> > > >>> > > Best, >>> > > Jorge >>> > > >>> > >>> >>
