Thanks Ryan for the clarification! I think now I'm confident in my interpretation and updated the PR [1]. Could you please check it out when you have time?
[1] https://github.com/apache/parquet-format/pull/466 Best, Gang On Fri, Nov 1, 2024 at 6:36 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > Hopefully I can help because I wrote those rules. > > I think that the correct type is List<List<Int32>>. Because none of the > first 3 rules apply, the element type is the repeated type, which is a > repeated int32. > > The rules are primarily trying to account for cases where known structures > were used. If the repeated type is not a group then it can't be an > "element" group. If it is a group with more than one field, then it can't > be an "element" group. Rule 3 accounts for some known conventions. > Everything else (rule 4) assumes that the repeated type is the element type > of the list. > > My interpretation is that while the repeated type is a group of one > element, we know that the 3-level list structure is not used because the > inner type is repeated. Above this section, the spec stats that "Element > repetition must be `required` or `optional`, so we know it is not an > element group. Maybe that's a new rule we should document. > > On Thu, Oct 31, 2024 at 2:36 AM Gang Wu <ust...@gmail.com> wrote: > > > > What does "the repeated field's type is the element type with the > > repeated > > > field's repetition" mean? Is "repeated field's type" equivalent to "the > > > type of the only field of the repeated group"? So, is the expected > > outcome > > > of rule 4 an actual List<element> without a Tuple/Struct? > > > > Yes, that's exactly my interpretation. In rule 4, the repeated field > > MUST be a group with a single field, which is as below: > > ``` > > optional group foo (LIST) { > > repeated group bar (ANNOTATION) { > > required|optional|repeated TYPE baz; > > }; > > } > > ``` > > > > There are following cases based on ANNOTATION and TYPE: > > > > *(a) Three-level list encoding with different names: List<TYPE > ?nullable>* > > ``` > > optional group foo (LIST) { > > repeated group bar { > > required|optional TYPE baz; > > }; > > } > > ``` > > - ANNOTATION: not specified > > - TYPE: either primitive or group > > > > *(b) Two-level list encoding: List<List<TYPE notnull> notnull>* > > ``` > > optional group foo (LIST) { > > repeated group bar (LIST) { > > repeated TYPE baz; > > }; > > } > > ``` > > - ANNOTATION: MUST be LIST. Three-level LIST and MAP cannot have > > repeated repetition according to the spec. > > - TYPE: either primitive or group (which MUST be two-level list) > > > > *(c) Two-level list encoding: List<OneTuple<List<TYPE notnull>> notnull>* > > ``` > > optional group foo (LIST) { > > repeated group bar { > > repeated TYPE baz; > > }; > > } > > ``` > > - ANNOTATION: not specified > > - TYPE: either primitive or group (which MUST be two-level list) > > > > Please note that (c) assumes that `repeated TYPE baz` produces > > `List<TYPE>` without LIST annotation. > > > > > > > We should list the examples at the actual rule or at least reference > the > > > related rule to be more clear. > > > > Good suggestion! I'll try to add them. > > > > Best, > > Gang > > > > > > On Thu, Oct 31, 2024 at 4:41 PM Gábor Szádovszky <ga...@apache.org> > wrote: > > > > > Thanks, Gang. > > > > > > Maybe, I don't get the actual expected outcome of rule 4. What does > "the > > > repeated field's type is the element type with the repeated field's > > > repetition" mean? Is "repeated field's type" equivalent to "the type of > > the > > > only field of the repeated group"? So, is the expected outcome of rule > 4 > > an > > > actual List<element> without a Tuple/Struct? > > > We should list the examples at the actual rule or at least reference > the > > > related rule to be more clear. > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 31., Cs, > 3:03): > > > > > > > I think the main differences between point 3 and 4 are as below: > > > > - Name: point 3 is the only exception to the suggestion that names > > should > > > > not be used. Since point 3 has higher priority than point 4, we > don't > > > > care > > > > about names when applying point 4. > > > > - Structure: point 3 is a special case to produce > > List<OneTuple<element>> > > > > and it is still a three-level list. However, point 4 might also > > produce > > > > List<List<element>> when the only field in the repeated field is > also > > > > repeated and this is a nested two-level list. > > > > > > > > On Wed, Oct 30, 2024 at 11:35 PM Gábor Szádovszky <ga...@apache.org> > > > > wrote: > > > > > > > > > The problem is not only whether 3-4 rules are clearly defined, but > > > their > > > > > expected outcome, especially in the light of "...these names may > not > > be > > > > > used in existing data and should not be enforced as errors when > > > > reading.". > > > > > It means that a reader should not rely on the naming of the "nodes" > > in > > > a > > > > > 3-level list but the structure itself. A 3-level list structure > shall > > > be > > > > > interpreted as a List<element>. > > > > > Meanwhile the 3-4 rules are just about the naming of a 3-level > > > structure, > > > > > both to be interpreted as a List<Tuple<element>> (or with your > > > suggestion > > > > > List<Struct<element>>). > > > > > > > > > > Regards, > > > > > GAbor > > > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, > > > > 13:57): > > > > > > > > > > > Thanks Gabor for the clarification. Please see my inline comments > > > > below. > > > > > > > > > > > > > 2. If the repeated field is a group with multiple fields, then > > its > > > > type > > > > > > > is the element type and elements are required. > > > > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this > is > > > > > > actually > > > > > > > a Struct instead of a Tuple) > > > > > > > > > > > > Yes, I agree that this is clear. I think we need to use Struct in > > the > > > > > spec. > > > > > > > > > > > > > 3. If the repeated field is a group with one field and is named > > > > either > > > > > > > array or uses the LIST-annotated group's name with _tuple > > appended > > > > then > > > > > > the > > > > > > > repeated type is the element type and elements are required. > > > > > > > What does it actually mean? With all these very specific naming > > > > > > constraints > > > > > > > we still say "...the repeated type is the element type...", > > hence: > > > > > > > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state > the > > > > same. > > > > > > Why > > > > > > > is it different from point 4? > > > > > > > > > > > > I think this is unclear about two things: > > > > > > - the repeated group field SHOULD NOT be LIST-annotated. > > > > > > - the only field within the the repeated group SHOULD NOT be > > repeated > > > > > > > > > > > > IMO, point 4 is pretty vague and it applies to all unhandled > cases > > > from > > > > > > point 1 to 3. > > > > > > > > > > > > > Instead of having such rules it would be much better to > actually > > > > > specify > > > > > > > steps to identify a structure from the point of facing a > LIST/MAP > > > > > > > logical types and do recursion at the element level so it is > > clear > > > > how > > > > > to > > > > > > > specify deeply nested structures. > > > > > > > > > > > > I agree. However this is more code-oriented and description of > all > > > > other > > > > > > types should be changed together to achieve this. > > > > > > > > > > > > > We may even extend the current ones. For example I've seen > > Parquet > > > > > > schemas > > > > > > > with repeated primitives without any LIST logical types. We > > should > > > > > accept > > > > > > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > > > > > > > > > Yes, I just saw exactly the same issue today: > > > > > > https://github.com/apache/arrow-rs/issues/6648 > > > > > > > > > > > > Best, > > > > > > Gang > > > > > > > > > > > > On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky < > ga...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > One more thing, however this is not about lists but maps. In > the > > > > > backward > > > > > > > compatibility rules of maps [4] it says "It is required that > the > > > > > repeated > > > > > > > group of key-value pairs is named key_value and that its fields > > are > > > > > named > > > > > > > key and value. However, these names may not be used in existing > > > data > > > > > and > > > > > > > should not be enforced as errors when reading." So the MAP > schema > > > > might > > > > > > not > > > > > > > contain either a `key` or a `value`. How to find them then? > > > > > Parquet-java > > > > > > in > > > > > > > the Avro binding constantly chooses the 0th element as key and > > the > > > > 1st > > > > > > one > > > > > > > for value [5]. But it does not seem to be correct since the > spec > > > does > > > > > not > > > > > > > say anything about the order. > > > > > > > > > > > > > > [4] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 > > > > > > > [5] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446 > > > > > > > > > > > > > > > > > > > > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. > > okt. > > > > 30., > > > > > > > Sze, > > > > > > > 10:28): > > > > > > > > > > > > > > > Hi Gang, > > > > > > > > > > > > > > > > I've recently started working on a similar topic so I'm glad > > > you've > > > > > > > > brought this up. > > > > > > > > > > > > > > > > I agree, [2] is not a big help here. TBH I am not sure that > the > > > > > current > > > > > > > > compatibility rules [3] are saying what they originally > wanted, > > > and > > > > > the > > > > > > > > related examples increase the confusion. > > > > > > > > > > > > > > > > (I'm using `@Nullable` where the nullability actually depends > > on > > > > the > > > > > > > > repetition of the related field.) > > > > > > > > > 1. If the repeated field is not a group, then its type is > the > > > > > element > > > > > > > > type and elements are required. > > > > > > > > This one is clear: `@Nullable List<@Nonnull primitive>` > > > > > > > > > 2. If the repeated field is a group with multiple fields, > > then > > > > its > > > > > > type > > > > > > > > is the element type and elements are required. > > > > > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: > this > > is > > > > > > > actually > > > > > > > > a Struct instead of a Tuple) > > > > > > > > > 3. If the repeated field is a group with one field and is > > named > > > > > > either > > > > > > > > array or uses the LIST-annotated group's name with _tuple > > > appended > > > > > then > > > > > > > the > > > > > > > > repeated type is the element type and elements are required. > > > > > > > > What does it actually mean? With all these very specific > naming > > > > > > > > constraints we still say "...the repeated type is the element > > > > > type...", > > > > > > > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even > examples > > > > state > > > > > > the > > > > > > > > same. Why is it different from point 4? > > > > > > > > > 4. Otherwise, the repeated field's type is the element type > > > with > > > > > the > > > > > > > > repeated field's repetition. > > > > > > > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But > > > > > otherwise > > > > > > > > what? It actually includes the officially expected 3-level > list > > > > > without > > > > > > > the > > > > > > > > naming convention that is suggested to be accepted. So why do > > we > > > > add > > > > > > the > > > > > > > > OneTuple? > > > > > > > > > > > > > > > > Instead of having such rules it would be much better to > > actually > > > > > > specify > > > > > > > > steps to identify a structure from the point of facing a > > LIST/MAP > > > > > > > > logical types and do recursion at the element level so it is > > > clear > > > > > how > > > > > > to > > > > > > > > specify deeply nested structures. > > > > > > > > We may even extend the current ones. For example I've seen > > > Parquet > > > > > > > schemas > > > > > > > > with repeated primitives without any LIST logical types. We > > > should > > > > > > accept > > > > > > > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Gabor > > > > > > > > > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. > 30., > > > Sze, > > > > > > > 5:11): > > > > > > > > > > > > > > > >> Hi, > > > > > > > >> > > > > > > > >> Recently I tried to fix a bug [1] on parquet-cpp whom is > > having > > > a > > > > > hard > > > > > > > >> time > > > > > > > >> reading Parquet file written by parquet-java with > > > > > > > >> *parquet.avro.write-old-list-structure=true* and with schema > > > > below: > > > > > > > >> ``` > > > > > > > >> optional group a (LIST) { > > > > > > > >> repeated group array (LIST) { > > > > > > > >> repeated int32 array; > > > > > > > >> } > > > > > > > >> } > > > > > > > >> ``` > > > > > > > >> > > > > > > > >> The question is whether it should be resolved as > > > > List<List<Integer>> > > > > > > or > > > > > > > >> List<OneTuple<List<Integer>>>. I think it should be the > former > > > but > > > > > the > > > > > > > >> answer from parquet-cpp is currently the latter. > > > > > > > >> > > > > > > > >> It has been explained in [2] but it is not clear on this > > > specific > > > > > > case. > > > > > > > I > > > > > > > >> have opened a PR to try to clarify it on the spec: [3]. > > > > > > > >> > > > > > > > >> Any feedback is appreciated! > > > > > > > >> > > > > > > > >> [1] https://github.com/apache/arrow/pull/43995 > > > > > > > >> [2] > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > > > > > >> [3] https://github.com/apache/parquet-format/pull/466 > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Gang > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >