One more thing, however this is not about lists but maps. In the backward compatibility rules of maps [4] it says "It is required that the repeated group of key-value pairs is named key_value and that its fields are named key and value. However, these names may not be used in existing data and should not be enforced as errors when reading." So the MAP schema might not contain either a `key` or a `value`. How to find them then? Parquet-java in the Avro binding constantly chooses the 0th element as key and the 1st one for value [5]. But it does not seem to be correct since the spec does not say anything about the order.
[4] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [5] https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446 Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. 30., Sze, 10:28): > Hi Gang, > > I've recently started working on a similar topic so I'm glad you've > brought this up. > > I agree, [2] is not a big help here. TBH I am not sure that the current > compatibility rules [3] are saying what they originally wanted, and the > related examples increase the confusion. > > (I'm using `@Nullable` where the nullability actually depends on the > repetition of the related field.) > > 1. If the repeated field is not a group, then its type is the element > type and elements are required. > This one is clear: `@Nullable List<@Nonnull primitive>` > > 2. If the repeated field is a group with multiple fields, then its type > is the element type and elements are required. > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is actually > a Struct instead of a Tuple) > > 3. If the repeated field is a group with one field and is named either > array or uses the LIST-annotated group's name with _tuple appended then the > repeated type is the element type and elements are required. > What does it actually mean? With all these very specific naming > constraints we still say "...the repeated type is the element type...", > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the > same. Why is it different from point 4? > > 4. Otherwise, the repeated field's type is the element type with the > repeated field's repetition. > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But otherwise > what? It actually includes the officially expected 3-level list without the > naming convention that is suggested to be accepted. So why do we add the > OneTuple? > > Instead of having such rules it would be much better to actually specify > steps to identify a structure from the point of facing a LIST/MAP > logical types and do recursion at the element level so it is clear how to > specify deeply nested structures. > We may even extend the current ones. For example I've seen Parquet schemas > with repeated primitives without any LIST logical types. We should accept > these as well as a `@Nonnull List<@Nonnull primitive>`. > > WDYT? > > Cheers, > Gabor > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, 5:11): > >> Hi, >> >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a hard >> time >> reading Parquet file written by parquet-java with >> *parquet.avro.write-old-list-structure=true* and with schema below: >> ``` >> optional group a (LIST) { >> repeated group array (LIST) { >> repeated int32 array; >> } >> } >> ``` >> >> The question is whether it should be resolved as List<List<Integer>> or >> List<OneTuple<List<Integer>>>. I think it should be the former but the >> answer from parquet-cpp is currently the latter. >> >> It has been explained in [2] but it is not clear on this specific case. I >> have opened a PR to try to clarify it on the spec: [3]. >> >> Any feedback is appreciated! >> >> [1] https://github.com/apache/arrow/pull/43995 >> [2] >> >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules >> [3] https://github.com/apache/parquet-format/pull/466 >> >> Best, >> Gang >> >