Thanks Gabor for the clarification. Please see my inline comments below.

> 2. If the repeated field is a group with multiple fields, then its type
> is the element type and elements are required.
> Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is actually
> a Struct instead of a Tuple)

Yes, I agree that this is clear. I think we need to use Struct in the spec.

> 3. If the repeated field is a group with one field and is named either
> array or uses the LIST-annotated group's name with _tuple appended then
the
> repeated type is the element type and elements are required.
> What does it actually mean? With all these very specific naming
constraints
> we still say "...the repeated type is the element type...", hence:
> `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the same.
Why
> is it different from point 4?

I think this is unclear about two things:
- the repeated group field SHOULD NOT be LIST-annotated.
- the only field within the the repeated group SHOULD NOT be repeated

IMO, point 4 is pretty vague and it applies to all unhandled cases from
point 1 to 3.

> Instead of having such rules it would be much better to actually specify
> steps to identify a structure from the point of facing a LIST/MAP
> logical types and do recursion at the element level so it is clear how to
> specify deeply nested structures.

I agree. However this is more code-oriented and description of all other
types should be changed together to achieve this.

> We may even extend the current ones. For example I've seen Parquet schemas
> with repeated primitives without any LIST logical types. We should accept
> these as well as a `@Nonnull List<@Nonnull primitive>`.

Yes, I just saw exactly the same issue today:
https://github.com/apache/arrow-rs/issues/6648

Best,
Gang

On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <ga...@apache.org> wrote:

> One more thing, however this is not about lists but maps. In the backward
> compatibility rules of maps [4] it says "It is required that the repeated
> group of key-value pairs is named key_value and that its fields are named
> key and value. However, these names may not be used in existing data and
> should not be enforced as errors when reading." So the MAP schema might not
> contain either a `key` or a `value`. How to find them then? Parquet-java in
> the Avro binding constantly chooses the 0th element as key and the 1st one
> for value [5]. But it does not seem to be correct since the spec does not
> say anything about the order.
>
> [4]
>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
> [5]
>
> https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446
>
>
> Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. 30.,
> Sze,
> 10:28):
>
> > Hi Gang,
> >
> > I've recently started working on a similar topic so I'm glad you've
> > brought this up.
> >
> > I agree, [2] is not a big help here. TBH I am not sure that the current
> > compatibility rules [3] are saying what they originally wanted, and the
> > related examples increase the confusion.
> >
> > (I'm using `@Nullable` where the nullability actually depends on the
> > repetition of the related field.)
> > > 1. If the repeated field is not a group, then its type is the element
> > type and elements are required.
> > This one is clear: `@Nullable List<@Nonnull primitive>`
> > > 2. If the repeated field is a group with multiple fields, then its type
> > is the element type and elements are required.
> > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is
> actually
> > a Struct instead of a Tuple)
> > > 3. If the repeated field is a group with one field and is named either
> > array or uses the LIST-annotated group's name with _tuple appended then
> the
> > repeated type is the element type and elements are required.
> > What does it actually mean? With all these very specific naming
> > constraints we still say "...the repeated type is the element type...",
> > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the
> > same. Why is it different from point 4?
> > > 4. Otherwise, the repeated field's type is the element type with the
> > repeated field's repetition.
> > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But otherwise
> > what? It actually includes the officially expected 3-level list without
> the
> > naming convention that is suggested to be accepted. So why do we add the
> > OneTuple?
> >
> > Instead of having such rules it would be much better to actually specify
> > steps to identify a structure from the point of facing a LIST/MAP
> > logical types and do recursion at the element level so it is clear how to
> > specify deeply nested structures.
> > We may even extend the current ones. For example I've seen Parquet
> schemas
> > with repeated primitives without any LIST logical types. We should accept
> > these as well as a `@Nonnull List<@Nonnull primitive>`.
> >
> > WDYT?
> >
> > Cheers,
> > Gabor
> >
> > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze,
> 5:11):
> >
> >> Hi,
> >>
> >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a hard
> >> time
> >> reading Parquet file written by parquet-java with
> >> *parquet.avro.write-old-list-structure=true* and with schema below:
> >> ```
> >> optional group a (LIST) {
> >>   repeated group array (LIST) {
> >>     repeated int32 array;
> >>   }
> >> }
> >> ```
> >>
> >> The question is whether it should be resolved as List<List<Integer>> or
> >> List<OneTuple<List<Integer>>>. I think it should be the former but the
> >> answer from parquet-cpp is currently the latter.
> >>
> >> It has been explained in [2] but it is not clear on this specific case.
> I
> >> have opened a PR to try to clarify it on the spec: [3].
> >>
> >> Any feedback is appreciated!
> >>
> >> [1] https://github.com/apache/arrow/pull/43995
> >> [2]
> >>
> >>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> >> [3] https://github.com/apache/parquet-format/pull/466
> >>
> >> Best,
> >> Gang
> >>
> >
>

Reply via email to