I think the main differences between point 3 and 4 are as below:
- Name: point 3 is the only exception to the suggestion that names should
  not be used. Since point 3 has higher priority than point 4, we don't care
  about names when applying point 4.
- Structure: point 3 is a special case to produce List<OneTuple<element>>
  and it is still a three-level list. However, point 4 might also produce
  List<List<element>> when the only field in the repeated field is also
  repeated and this is a nested two-level list.

On Wed, Oct 30, 2024 at 11:35 PM Gábor Szádovszky <ga...@apache.org> wrote:

> The problem is not only whether 3-4 rules are clearly defined, but their
> expected outcome, especially in the light of "...these names may not be
> used in existing data and should not be enforced as errors when reading.".
> It means that a reader should not rely on the naming of the "nodes" in a
> 3-level list but the structure itself. A 3-level list structure shall be
> interpreted as a List<element>.
> Meanwhile the 3-4 rules are just about the naming of a 3-level structure,
> both to be interpreted as a List<Tuple<element>> (or with your suggestion
> List<Struct<element>>).
>
> Regards,
> GAbor
>
> Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze, 13:57):
>
> > Thanks Gabor for the clarification. Please see my inline comments below.
> >
> > > 2. If the repeated field is a group with multiple fields, then its type
> > > is the element type and elements are required.
> > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is
> > actually
> > > a Struct instead of a Tuple)
> >
> > Yes, I agree that this is clear. I think we need to use Struct in the
> spec.
> >
> > > 3. If the repeated field is a group with one field and is named either
> > > array or uses the LIST-annotated group's name with _tuple appended then
> > the
> > > repeated type is the element type and elements are required.
> > > What does it actually mean? With all these very specific naming
> > constraints
> > > we still say "...the repeated type is the element type...", hence:
> > > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state the same.
> > Why
> > > is it different from point 4?
> >
> > I think this is unclear about two things:
> > - the repeated group field SHOULD NOT be LIST-annotated.
> > - the only field within the the repeated group SHOULD NOT be repeated
> >
> > IMO, point 4 is pretty vague and it applies to all unhandled cases from
> > point 1 to 3.
> >
> > > Instead of having such rules it would be much better to actually
> specify
> > > steps to identify a structure from the point of facing a LIST/MAP
> > > logical types and do recursion at the element level so it is clear how
> to
> > > specify deeply nested structures.
> >
> > I agree. However this is more code-oriented and description of all other
> > types should be changed together to achieve this.
> >
> > > We may even extend the current ones. For example I've seen Parquet
> > schemas
> > > with repeated primitives without any LIST logical types. We should
> accept
> > > these as well as a `@Nonnull List<@Nonnull primitive>`.
> >
> > Yes, I just saw exactly the same issue today:
> > https://github.com/apache/arrow-rs/issues/6648
> >
> > Best,
> > Gang
> >
> > On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <ga...@apache.org>
> wrote:
> >
> > > One more thing, however this is not about lists but maps. In the
> backward
> > > compatibility rules of maps [4] it says "It is required that the
> repeated
> > > group of key-value pairs is named key_value and that its fields are
> named
> > > key and value. However, these names may not be used in existing data
> and
> > > should not be enforced as errors when reading." So the MAP schema might
> > not
> > > contain either a `key` or a `value`. How to find them then?
> Parquet-java
> > in
> > > the Avro binding constantly chooses the 0th element as key and the 1st
> > one
> > > for value [5]. But it does not seem to be correct since the spec does
> not
> > > say anything about the order.
> > >
> > > [4]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
> > > [5]
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446
> > >
> > >
> > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024. okt. 30.,
> > > Sze,
> > > 10:28):
> > >
> > > > Hi Gang,
> > > >
> > > > I've recently started working on a similar topic so I'm glad you've
> > > > brought this up.
> > > >
> > > > I agree, [2] is not a big help here. TBH I am not sure that the
> current
> > > > compatibility rules [3] are saying what they originally wanted, and
> the
> > > > related examples increase the confusion.
> > > >
> > > > (I'm using `@Nullable` where the nullability actually depends on the
> > > > repetition of the related field.)
> > > > > 1. If the repeated field is not a group, then its type is the
> element
> > > > type and elements are required.
> > > > This one is clear: `@Nullable List<@Nonnull primitive>`
> > > > > 2. If the repeated field is a group with multiple fields, then its
> > type
> > > > is the element type and elements are required.
> > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this is
> > > actually
> > > > a Struct instead of a Tuple)
> > > > > 3. If the repeated field is a group with one field and is named
> > either
> > > > array or uses the LIST-annotated group's name with _tuple appended
> then
> > > the
> > > > repeated type is the element type and elements are required.
> > > > What does it actually mean? With all these very specific naming
> > > > constraints we still say "...the repeated type is the element
> type...",
> > > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state
> > the
> > > > same. Why is it different from point 4?
> > > > > 4. Otherwise, the repeated field's type is the element type with
> the
> > > > repeated field's repetition.
> > > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But
> otherwise
> > > > what? It actually includes the officially expected 3-level list
> without
> > > the
> > > > naming convention that is suggested to be accepted. So why do we add
> > the
> > > > OneTuple?
> > > >
> > > > Instead of having such rules it would be much better to actually
> > specify
> > > > steps to identify a structure from the point of facing a LIST/MAP
> > > > logical types and do recursion at the element level so it is clear
> how
> > to
> > > > specify deeply nested structures.
> > > > We may even extend the current ones. For example I've seen Parquet
> > > schemas
> > > > with repeated primitives without any LIST logical types. We should
> > accept
> > > > these as well as a `@Nonnull List<@Nonnull primitive>`.
> > > >
> > > > WDYT?
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze,
> > > 5:11):
> > > >
> > > >> Hi,
> > > >>
> > > >> Recently I tried to fix a bug [1] on parquet-cpp whom is having a
> hard
> > > >> time
> > > >> reading Parquet file written by parquet-java with
> > > >> *parquet.avro.write-old-list-structure=true* and with schema below:
> > > >> ```
> > > >> optional group a (LIST) {
> > > >>   repeated group array (LIST) {
> > > >>     repeated int32 array;
> > > >>   }
> > > >> }
> > > >> ```
> > > >>
> > > >> The question is whether it should be resolved as List<List<Integer>>
> > or
> > > >> List<OneTuple<List<Integer>>>. I think it should be the former but
> the
> > > >> answer from parquet-cpp is currently the latter.
> > > >>
> > > >> It has been explained in [2] but it is not clear on this specific
> > case.
> > > I
> > > >> have opened a PR to try to clarify it on the spec: [3].
> > > >>
> > > >> Any feedback is appreciated!
> > > >>
> > > >> [1] https://github.com/apache/arrow/pull/43995
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> > > >> [3] https://github.com/apache/parquet-format/pull/466
> > > >>
> > > >> Best,
> > > >> Gang
> > > >>
> > > >
> > >
> >
>

Reply via email to