Thanks Gabor for the clarification. Please see my inline comments below. > 2. If the repeated field is a group with multiple fields, then its type > is the element type and elements are required. > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is actually > a Struct instead of a Tuple)
Yes, I agree that this is clear. I think we need to use Struct in the spec. > 3. If the repeated field is a group with one field and is named either > array or uses the LIST-annotated group's name with _tuple appended then the > repeated type is the element type and elements are required. > What does it actually mean? With all these very specific naming constraints > we still say "...the repeated type is the element type...", hence: > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the same. Why > is it different from point 4? I think this is unclear about two things: - the repeated group field SHOULD NOT be LIST-annotated. - the only field within the the repeated group SHOULD NOT be repeated IMO, point 4 is pretty vague and it applies to all unhandled cases from point 1 to 3. > Instead of having such rules it would be much better to actually specify > steps to identify a structure from the point of facing a LIST/MAP > logical types and do recursion at the element level so it is clear how to > specify deeply nested structures. I agree. However this is more code-oriented and description of all other types should be changed together to achieve this. > We may even extend the current ones. For example I've seen Parquet schemas > with repeated primitives without any LIST logical types. We should accept > these as well as a `@Nonnull List<@Nonnull primitive>`. Yes, I just saw exactly the same issue today: https://github.com/apache/arrow-rs/issues/6648 Best, Gang On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <ga...@apache.org> wrote: > One more thing, however this is not about lists but maps. In the backward > compatibility rules of maps [4] it says "It is required that the repeated > group of key-value pairs is named key_value and that its fields are named > key and value. However, these names may not be used in existing data and > should not be enforced as errors when reading." So the MAP schema might not > contain either a `key` or a `value`. How to find them then? Parquet-java in > the Avro binding constantly chooses the 0th element as key and the 1st one > for value [5]. But it does not seem to be correct since the spec does not > say anything about the order. > > [4] > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 > [5] > > https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446 > > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. 30., > Sze, > 10:28): > > > Hi Gang, > > > > I've recently started working on a similar topic so I'm glad you've > > brought this up. > > > > I agree, [2] is not a big help here. TBH I am not sure that the current > > compatibility rules [3] are saying what they originally wanted, and the > > related examples increase the confusion. > > > > (I'm using `@Nullable` where the nullability actually depends on the > > repetition of the related field.) > > > 1. If the repeated field is not a group, then its type is the element > > type and elements are required. > > This one is clear: `@Nullable List<@Nonnull primitive>` > > > 2. If the repeated field is a group with multiple fields, then its type > > is the element type and elements are required. > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is > actually > > a Struct instead of a Tuple) > > > 3. If the repeated field is a group with one field and is named either > > array or uses the LIST-annotated group's name with _tuple appended then > the > > repeated type is the element type and elements are required. > > What does it actually mean? With all these very specific naming > > constraints we still say "...the repeated type is the element type...", > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the > > same. Why is it different from point 4? > > > 4. Otherwise, the repeated field's type is the element type with the > > repeated field's repetition. > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But otherwise > > what? It actually includes the officially expected 3-level list without > the > > naming convention that is suggested to be accepted. So why do we add the > > OneTuple? > > > > Instead of having such rules it would be much better to actually specify > > steps to identify a structure from the point of facing a LIST/MAP > > logical types and do recursion at the element level so it is clear how to > > specify deeply nested structures. > > We may even extend the current ones. For example I've seen Parquet > schemas > > with repeated primitives without any LIST logical types. We should accept > > these as well as a `@Nonnull List<@Nonnull primitive>`. > > > > WDYT? > > > > Cheers, > > Gabor > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, > 5:11): > > > >> Hi, > >> > >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a hard > >> time > >> reading Parquet file written by parquet-java with > >> *parquet.avro.write-old-list-structure=true* and with schema below: > >> ``` > >> optional group a (LIST) { > >> repeated group array (LIST) { > >> repeated int32 array; > >> } > >> } > >> ``` > >> > >> The question is whether it should be resolved as List<List<Integer>> or > >> List<OneTuple<List<Integer>>>. I think it should be the former but the > >> answer from parquet-cpp is currently the latter. > >> > >> It has been explained in [2] but it is not clear on this specific case. > I > >> have opened a PR to try to clarify it on the spec: [3]. > >> > >> Any feedback is appreciated! > >> > >> [1] https://github.com/apache/arrow/pull/43995 > >> [2] > >> > >> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > >> [3] https://github.com/apache/parquet-format/pull/466 > >> > >> Best, > >> Gang > >> > > >