The problem is not only whether 3-4 rules are clearly defined, but their expected outcome, especially in the light of "...these names may not be used in existing data and should not be enforced as errors when reading.". It means that a reader should not rely on the naming of the "nodes" in a 3-level list but the structure itself. A 3-level list structure shall be interpreted as a List<element>. Meanwhile the 3-4 rules are just about the naming of a 3-level structure, both to be interpreted as a List<Tuple<element>> (or with your suggestion List<Struct<element>>).
Regards, GAbor Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, 13:57): > Thanks Gabor for the clarification. Please see my inline comments below. > > > 2. If the repeated field is a group with multiple fields, then its type > > is the element type and elements are required. > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is > actually > > a Struct instead of a Tuple) > > Yes, I agree that this is clear. I think we need to use Struct in the spec. > > > 3. If the repeated field is a group with one field and is named either > > array or uses the LIST-annotated group's name with _tuple appended then > the > > repeated type is the element type and elements are required. > > What does it actually mean? With all these very specific naming > constraints > > we still say "...the repeated type is the element type...", hence: > > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the same. > Why > > is it different from point 4? > > I think this is unclear about two things: > - the repeated group field SHOULD NOT be LIST-annotated. > - the only field within the the repeated group SHOULD NOT be repeated > > IMO, point 4 is pretty vague and it applies to all unhandled cases from > point 1 to 3. > > > Instead of having such rules it would be much better to actually specify > > steps to identify a structure from the point of facing a LIST/MAP > > logical types and do recursion at the element level so it is clear how to > > specify deeply nested structures. > > I agree. However this is more code-oriented and description of all other > types should be changed together to achieve this. > > > We may even extend the current ones. For example I've seen Parquet > schemas > > with repeated primitives without any LIST logical types. We should accept > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > Yes, I just saw exactly the same issue today: > https://github.com/apache/arrow-rs/issues/6648 > > Best, > Gang > > On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <ga...@apache.org> wrote: > > > One more thing, however this is not about lists but maps. In the backward > > compatibility rules of maps [4] it says "It is required that the repeated > > group of key-value pairs is named key_value and that its fields are named > > key and value. However, these names may not be used in existing data and > > should not be enforced as errors when reading." So the MAP schema might > not > > contain either a `key` or a `value`. How to find them then? Parquet-java > in > > the Avro binding constantly chooses the 0th element as key and the 1st > one > > for value [5]. But it does not seem to be correct since the spec does not > > say anything about the order. > > > > [4] > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 > > [5] > > > > > https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446 > > > > > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. 30., > > Sze, > > 10:28): > > > > > Hi Gang, > > > > > > I've recently started working on a similar topic so I'm glad you've > > > brought this up. > > > > > > I agree, [2] is not a big help here. TBH I am not sure that the current > > > compatibility rules [3] are saying what they originally wanted, and the > > > related examples increase the confusion. > > > > > > (I'm using `@Nullable` where the nullability actually depends on the > > > repetition of the related field.) > > > > 1. If the repeated field is not a group, then its type is the element > > > type and elements are required. > > > This one is clear: `@Nullable List<@Nonnull primitive>` > > > > 2. If the repeated field is a group with multiple fields, then its > type > > > is the element type and elements are required. > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is > > actually > > > a Struct instead of a Tuple) > > > > 3. If the repeated field is a group with one field and is named > either > > > array or uses the LIST-annotated group's name with _tuple appended then > > the > > > repeated type is the element type and elements are required. > > > What does it actually mean? With all these very specific naming > > > constraints we still say "...the repeated type is the element type...", > > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state > the > > > same. Why is it different from point 4? > > > > 4. Otherwise, the repeated field's type is the element type with the > > > repeated field's repetition. > > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But otherwise > > > what? It actually includes the officially expected 3-level list without > > the > > > naming convention that is suggested to be accepted. So why do we add > the > > > OneTuple? > > > > > > Instead of having such rules it would be much better to actually > specify > > > steps to identify a structure from the point of facing a LIST/MAP > > > logical types and do recursion at the element level so it is clear how > to > > > specify deeply nested structures. > > > We may even extend the current ones. For example I've seen Parquet > > schemas > > > with repeated primitives without any LIST logical types. We should > accept > > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > > > WDYT? > > > > > > Cheers, > > > Gabor > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, > > 5:11): > > > > > >> Hi, > > >> > > >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a hard > > >> time > > >> reading Parquet file written by parquet-java with > > >> *parquet.avro.write-old-list-structure=true* and with schema below: > > >> ``` > > >> optional group a (LIST) { > > >> repeated group array (LIST) { > > >> repeated int32 array; > > >> } > > >> } > > >> ``` > > >> > > >> The question is whether it should be resolved as List<List<Integer>> > or > > >> List<OneTuple<List<Integer>>>. I think it should be the former but the > > >> answer from parquet-cpp is currently the latter. > > >> > > >> It has been explained in [2] but it is not clear on this specific > case. > > I > > >> have opened a PR to try to clarify it on the spec: [3]. > > >> > > >> Any feedback is appreciated! > > >> > > >> [1] https://github.com/apache/arrow/pull/43995 > > >> [2] > > >> > > >> > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > >> [3] https://github.com/apache/parquet-format/pull/466 > > >> > > >> Best, > > >> Gang > > >> > > > > > >