Thanks, Gang. Maybe, I don't get the actual expected outcome of rule 4. What does "the repeated field's type is the element type with the repeated field's repetition" mean? Is "repeated field's type" equivalent to "the type of the only field of the repeated group"? So, is the expected outcome of rule 4 an actual List<element> without a Tuple/Struct? We should list the examples at the actual rule or at least reference the related rule to be more clear.
Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 31., Cs, 3:03): > I think the main differences between point 3 and 4 are as below: > - Name: point 3 is the only exception to the suggestion that names should > not be used. Since point 3 has higher priority than point 4, we don't > care > about names when applying point 4. > - Structure: point 3 is a special case to produce List<OneTuple<element>> > and it is still a three-level list. However, point 4 might also produce > List<List<element>> when the only field in the repeated field is also > repeated and this is a nested two-level list. > > On Wed, Oct 30, 2024 at 11:35 PM Gábor Szádovszky <ga...@apache.org> > wrote: > > > The problem is not only whether 3-4 rules are clearly defined, but their > > expected outcome, especially in the light of "...these names may not be > > used in existing data and should not be enforced as errors when > reading.". > > It means that a reader should not rely on the naming of the "nodes" in a > > 3-level list but the structure itself. A 3-level list structure shall be > > interpreted as a List<element>. > > Meanwhile the 3-4 rules are just about the naming of a 3-level structure, > > both to be interpreted as a List<Tuple<element>> (or with your suggestion > > List<Struct<element>>). > > > > Regards, > > GAbor > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, > 13:57): > > > > > Thanks Gabor for the clarification. Please see my inline comments > below. > > > > > > > 2. If the repeated field is a group with multiple fields, then its > type > > > > is the element type and elements are required. > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is > > > actually > > > > a Struct instead of a Tuple) > > > > > > Yes, I agree that this is clear. I think we need to use Struct in the > > spec. > > > > > > > 3. If the repeated field is a group with one field and is named > either > > > > array or uses the LIST-annotated group's name with _tuple appended > then > > > the > > > > repeated type is the element type and elements are required. > > > > What does it actually mean? With all these very specific naming > > > constraints > > > > we still say "...the repeated type is the element type...", hence: > > > > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the > same. > > > Why > > > > is it different from point 4? > > > > > > I think this is unclear about two things: > > > - the repeated group field SHOULD NOT be LIST-annotated. > > > - the only field within the the repeated group SHOULD NOT be repeated > > > > > > IMO, point 4 is pretty vague and it applies to all unhandled cases from > > > point 1 to 3. > > > > > > > Instead of having such rules it would be much better to actually > > specify > > > > steps to identify a structure from the point of facing a LIST/MAP > > > > logical types and do recursion at the element level so it is clear > how > > to > > > > specify deeply nested structures. > > > > > > I agree. However this is more code-oriented and description of all > other > > > types should be changed together to achieve this. > > > > > > > We may even extend the current ones. For example I've seen Parquet > > > schemas > > > > with repeated primitives without any LIST logical types. We should > > accept > > > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > > > Yes, I just saw exactly the same issue today: > > > https://github.com/apache/arrow-rs/issues/6648 > > > > > > Best, > > > Gang > > > > > > On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <ga...@apache.org> > > wrote: > > > > > > > One more thing, however this is not about lists but maps. In the > > backward > > > > compatibility rules of maps [4] it says "It is required that the > > repeated > > > > group of key-value pairs is named key_value and that its fields are > > named > > > > key and value. However, these names may not be used in existing data > > and > > > > should not be enforced as errors when reading." So the MAP schema > might > > > not > > > > contain either a `key` or a `value`. How to find them then? > > Parquet-java > > > in > > > > the Avro binding constantly chooses the 0th element as key and the > 1st > > > one > > > > for value [5]. But it does not seem to be correct since the spec does > > not > > > > say anything about the order. > > > > > > > > [4] > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 > > > > [5] > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446 > > > > > > > > > > > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. > 30., > > > > Sze, > > > > 10:28): > > > > > > > > > Hi Gang, > > > > > > > > > > I've recently started working on a similar topic so I'm glad you've > > > > > brought this up. > > > > > > > > > > I agree, [2] is not a big help here. TBH I am not sure that the > > current > > > > > compatibility rules [3] are saying what they originally wanted, and > > the > > > > > related examples increase the confusion. > > > > > > > > > > (I'm using `@Nullable` where the nullability actually depends on > the > > > > > repetition of the related field.) > > > > > > 1. If the repeated field is not a group, then its type is the > > element > > > > > type and elements are required. > > > > > This one is clear: `@Nullable List<@Nonnull primitive>` > > > > > > 2. If the repeated field is a group with multiple fields, then > its > > > type > > > > > is the element type and elements are required. > > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is > > > > actually > > > > > a Struct instead of a Tuple) > > > > > > 3. If the repeated field is a group with one field and is named > > > either > > > > > array or uses the LIST-annotated group's name with _tuple appended > > then > > > > the > > > > > repeated type is the element type and elements are required. > > > > > What does it actually mean? With all these very specific naming > > > > > constraints we still say "...the repeated type is the element > > type...", > > > > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples > state > > > the > > > > > same. Why is it different from point 4? > > > > > > 4. Otherwise, the repeated field's type is the element type with > > the > > > > > repeated field's repetition. > > > > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But > > otherwise > > > > > what? It actually includes the officially expected 3-level list > > without > > > > the > > > > > naming convention that is suggested to be accepted. So why do we > add > > > the > > > > > OneTuple? > > > > > > > > > > Instead of having such rules it would be much better to actually > > > specify > > > > > steps to identify a structure from the point of facing a LIST/MAP > > > > > logical types and do recursion at the element level so it is clear > > how > > > to > > > > > specify deeply nested structures. > > > > > We may even extend the current ones. For example I've seen Parquet > > > > schemas > > > > > with repeated primitives without any LIST logical types. We should > > > accept > > > > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > > > > > > > WDYT? > > > > > > > > > > Cheers, > > > > > Gabor > > > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, > > > > 5:11): > > > > > > > > > >> Hi, > > > > >> > > > > >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a > > hard > > > > >> time > > > > >> reading Parquet file written by parquet-java with > > > > >> *parquet.avro.write-old-list-structure=true* and with schema > below: > > > > >> ``` > > > > >> optional group a (LIST) { > > > > >> repeated group array (LIST) { > > > > >> repeated int32 array; > > > > >> } > > > > >> } > > > > >> ``` > > > > >> > > > > >> The question is whether it should be resolved as > List<List<Integer>> > > > or > > > > >> List<OneTuple<List<Integer>>>. I think it should be the former but > > the > > > > >> answer from parquet-cpp is currently the latter. > > > > >> > > > > >> It has been explained in [2] but it is not clear on this specific > > > case. > > > > I > > > > >> have opened a PR to try to clarify it on the spec: [3]. > > > > >> > > > > >> Any feedback is appreciated! > > > > >> > > > > >> [1] https://github.com/apache/arrow/pull/43995 > > > > >> [2] > > > > >> > > > > >> > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > > >> [3] https://github.com/apache/parquet-format/pull/466 > > > > >> > > > > >> Best, > > > > >> Gang > > > > >> > > > > > > > > > > > > > > >