Thanks Ryan for the clarification!

I think now I'm confident in my interpretation and updated the PR [1].
Could you please check it out when you have time?

[1] https://github.com/apache/parquet-format/pull/466

Best,
Gang


On Fri, Nov 1, 2024 at 6:36 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> Hopefully I can help because I wrote those rules.
>
> I think that the correct type is List<List<Int32>>. Because none of the
> first 3 rules apply, the element type is the repeated type, which is a
> repeated int32.
>
> The rules are primarily trying to account for cases where known structures
> were used. If the repeated type is not a group then it can't be an
> "element" group. If it is a group with more than one field, then it can't
> be an "element" group. Rule 3 accounts for some known conventions.
> Everything else (rule 4) assumes that the repeated type is the element type
> of the list.
>
> My interpretation is that while the repeated type is a group of one
> element, we know that the 3-level list structure is not used because the
> inner type is repeated. Above this section, the spec stats that "Element
> repetition must be `required` or `optional`, so we know it is not an
> element group. Maybe that's a new rule we should document.
>
> On Thu, Oct 31, 2024 at 2:36 AM Gang Wu <ust...@gmail.com> wrote:
>
> > > What does "the repeated field's type is the element type with the
> > repeated
> > > field's repetition" mean? Is "repeated field's type" equivalent to "the
> > > type of the only field of the repeated group"? So, is the expected
> > outcome
> > > of rule 4 an actual List<element> without a Tuple/Struct?
> >
> > Yes, that's exactly my interpretation. In rule 4, the repeated field
> > MUST be a group with a single field, which is as below:
> > ```
> > optional group foo (LIST) {
> >   repeated group bar (ANNOTATION) {
> >     required|optional|repeated TYPE baz;
> >   };
> > }
> > ```
> >
> > There are following cases based on ANNOTATION and TYPE:
> >
> > *(a) Three-level list encoding with different names: List<TYPE
> ?nullable>*
> > ```
> > optional group foo (LIST) {
> >   repeated group bar {
> >     required|optional TYPE baz;
> >   };
> > }
> > ```
> > - ANNOTATION: not specified
> > - TYPE: either primitive or group
> >
> > *(b) Two-level list encoding: List<List<TYPE notnull> notnull>*
> > ```
> > optional group foo (LIST) {
> >   repeated group bar (LIST) {
> >     repeated TYPE baz;
> >   };
> > }
> > ```
> > - ANNOTATION: MUST be LIST. Three-level LIST and MAP cannot have
> >               repeated repetition according to the spec.
> > - TYPE: either primitive or group (which MUST be two-level list)
> >
> > *(c) Two-level list encoding: List<OneTuple<List<TYPE notnull>> notnull>*
> > ```
> > optional group foo (LIST) {
> >   repeated group bar {
> >     repeated TYPE baz;
> >   };
> > }
> > ```
> > - ANNOTATION: not specified
> > - TYPE: either primitive or group (which MUST be two-level list)
> >
> > Please note that (c) assumes that `repeated TYPE baz` produces
> > `List<TYPE>` without LIST annotation.
> >
> >
> > > We should list the examples at the actual rule or at least reference
> the
> > > related rule to be more clear.
> >
> > Good suggestion! I'll try to add them.
> >
> > Best,
> > Gang
> >
> >
> > On Thu, Oct 31, 2024 at 4:41 PM Gábor Szádovszky <ga...@apache.org>
> wrote:
> >
> > > Thanks, Gang.
> > >
> > > Maybe, I don't get the actual expected outcome of rule 4. What does
> "the
> > > repeated field's type is the element type with the repeated field's
> > > repetition" mean? Is "repeated field's type" equivalent to "the type of
> > the
> > > only field of the repeated group"? So, is the expected outcome of rule
> 4
> > an
> > > actual List<element> without a Tuple/Struct?
> > > We should list the examples at the actual rule or at least reference
> the
> > > related rule to be more clear.
> > >
> > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 31., Cs,
> 3:03):
> > >
> > > > I think the main differences between point 3 and 4 are as below:
> > > > - Name: point 3 is the only exception to the suggestion that names
> > should
> > > >   not be used. Since point 3 has higher priority than point 4, we
> don't
> > > > care
> > > >   about names when applying point 4.
> > > > - Structure: point 3 is a special case to produce
> > List<OneTuple<element>>
> > > >   and it is still a three-level list. However, point 4 might also
> > produce
> > > >   List<List<element>> when the only field in the repeated field is
> also
> > > >   repeated and this is a nested two-level list.
> > > >
> > > > On Wed, Oct 30, 2024 at 11:35 PM Gábor Szádovszky <ga...@apache.org>
> > > > wrote:
> > > >
> > > > > The problem is not only whether 3-4 rules are clearly defined, but
> > > their
> > > > > expected outcome, especially in the light of "...these names may
> not
> > be
> > > > > used in existing data and should not be enforced as errors when
> > > > reading.".
> > > > > It means that a reader should not rely on the naming of the "nodes"
> > in
> > > a
> > > > > 3-level list but the structure itself. A 3-level list structure
> shall
> > > be
> > > > > interpreted as a List<element>.
> > > > > Meanwhile the 3-4 rules are just about the naming of a 3-level
> > > structure,
> > > > > both to be interpreted as a List<Tuple<element>> (or with your
> > > suggestion
> > > > > List<Struct<element>>).
> > > > >
> > > > > Regards,
> > > > > GAbor
> > > > >
> > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt. 30., Sze,
> > > > 13:57):
> > > > >
> > > > > > Thanks Gabor for the clarification. Please see my inline comments
> > > > below.
> > > > > >
> > > > > > > 2. If the repeated field is a group with multiple fields, then
> > its
> > > > type
> > > > > > > is the element type and elements are required.
> > > > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note: this
> is
> > > > > > actually
> > > > > > > a Struct instead of a Tuple)
> > > > > >
> > > > > > Yes, I agree that this is clear. I think we need to use Struct in
> > the
> > > > > spec.
> > > > > >
> > > > > > > 3. If the repeated field is a group with one field and is named
> > > > either
> > > > > > > array or uses the LIST-annotated group's name with _tuple
> > appended
> > > > then
> > > > > > the
> > > > > > > repeated type is the element type and elements are required.
> > > > > > > What does it actually mean? With all these very specific naming
> > > > > > constraints
> > > > > > > we still say "...the repeated type is the element type...",
> > hence:
> > > > > > > `@Nullable List<@Nonnull OneTuple<...>>`. Even examples state
> the
> > > > same.
> > > > > > Why
> > > > > > > is it different from point 4?
> > > > > >
> > > > > > I think this is unclear about two things:
> > > > > > - the repeated group field SHOULD NOT be LIST-annotated.
> > > > > > - the only field within the the repeated group SHOULD NOT be
> > repeated
> > > > > >
> > > > > > IMO, point 4 is pretty vague and it applies to all unhandled
> cases
> > > from
> > > > > > point 1 to 3.
> > > > > >
> > > > > > > Instead of having such rules it would be much better to
> actually
> > > > > specify
> > > > > > > steps to identify a structure from the point of facing a
> LIST/MAP
> > > > > > > logical types and do recursion at the element level so it is
> > clear
> > > > how
> > > > > to
> > > > > > > specify deeply nested structures.
> > > > > >
> > > > > > I agree. However this is more code-oriented and description of
> all
> > > > other
> > > > > > types should be changed together to achieve this.
> > > > > >
> > > > > > > We may even extend the current ones. For example I've seen
> > Parquet
> > > > > > schemas
> > > > > > > with repeated primitives without any LIST logical types. We
> > should
> > > > > accept
> > > > > > > these as well as a `@Nonnull List<@Nonnull primitive>`.
> > > > > >
> > > > > > Yes, I just saw exactly the same issue today:
> > > > > > https://github.com/apache/arrow-rs/issues/6648
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > > > On Wed, Oct 30, 2024 at 6:36 PM Gábor Szádovszky <
> ga...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > One more thing, however this is not about lists but maps. In
> the
> > > > > backward
> > > > > > > compatibility rules of maps [4] it says "It is required that
> the
> > > > > repeated
> > > > > > > group of key-value pairs is named key_value and that its fields
> > are
> > > > > named
> > > > > > > key and value. However, these names may not be used in existing
> > > data
> > > > > and
> > > > > > > should not be enforced as errors when reading." So the MAP
> schema
> > > > might
> > > > > > not
> > > > > > > contain either a `key` or a `value`. How to find them then?
> > > > > Parquet-java
> > > > > > in
> > > > > > > the Avro binding constantly chooses the 0th element as key and
> > the
> > > > 1st
> > > > > > one
> > > > > > > for value [5]. But it does not seem to be correct since the
> spec
> > > does
> > > > > not
> > > > > > > say anything about the order.
> > > > > > >
> > > > > > > [4]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
> > > > > > > [5]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L446
> > > > > > >
> > > > > > >
> > > > > > > Gábor Szádovszky <ga...@apache.org> ezt írta (időpont: 2024.
> > okt.
> > > > 30.,
> > > > > > > Sze,
> > > > > > > 10:28):
> > > > > > >
> > > > > > > > Hi Gang,
> > > > > > > >
> > > > > > > > I've recently started working on a similar topic so I'm glad
> > > you've
> > > > > > > > brought this up.
> > > > > > > >
> > > > > > > > I agree, [2] is not a big help here. TBH I am not sure that
> the
> > > > > current
> > > > > > > > compatibility rules [3] are saying what they originally
> wanted,
> > > and
> > > > > the
> > > > > > > > related examples increase the confusion.
> > > > > > > >
> > > > > > > > (I'm using `@Nullable` where the nullability actually depends
> > on
> > > > the
> > > > > > > > repetition of the related field.)
> > > > > > > > > 1. If the repeated field is not a group, then its type is
> the
> > > > > element
> > > > > > > > type and elements are required.
> > > > > > > > This one is clear: `@Nullable List<@Nonnull primitive>`
> > > > > > > > > 2. If the repeated field is a group with multiple fields,
> > then
> > > > its
> > > > > > type
> > > > > > > > is the element type and elements are required.
> > > > > > > > Quite clear: `@Nullable List<@Nonnull Tuple<...>>` (Note:
> this
> > is
> > > > > > > actually
> > > > > > > > a Struct instead of a Tuple)
> > > > > > > > > 3. If the repeated field is a group with one field and is
> > named
> > > > > > either
> > > > > > > > array or uses the LIST-annotated group's name with _tuple
> > > appended
> > > > > then
> > > > > > > the
> > > > > > > > repeated type is the element type and elements are required.
> > > > > > > > What does it actually mean? With all these very specific
> naming
> > > > > > > > constraints we still say "...the repeated type is the element
> > > > > type...",
> > > > > > > > hence: `@Nullable List<@Nonnull OneTuple<...>>`. Even
> examples
> > > > state
> > > > > > the
> > > > > > > > same. Why is it different from point 4?
> > > > > > > > > 4. Otherwise, the repeated field's type is the element type
> > > with
> > > > > the
> > > > > > > > repeated field's repetition.
> > > > > > > > Kind of clear: `@Nullable List<@Nonnull OneTuple<...>>`. But
> > > > > otherwise
> > > > > > > > what? It actually includes the officially expected 3-level
> list
> > > > > without
> > > > > > > the
> > > > > > > > naming convention that is suggested to be accepted. So why do
> > we
> > > > add
> > > > > > the
> > > > > > > > OneTuple?
> > > > > > > >
> > > > > > > > Instead of having such rules it would be much better to
> > actually
> > > > > > specify
> > > > > > > > steps to identify a structure from the point of facing a
> > LIST/MAP
> > > > > > > > logical types and do recursion at the element level so it is
> > > clear
> > > > > how
> > > > > > to
> > > > > > > > specify deeply nested structures.
> > > > > > > > We may even extend the current ones. For example I've seen
> > > Parquet
> > > > > > > schemas
> > > > > > > > with repeated primitives without any LIST logical types. We
> > > should
> > > > > > accept
> > > > > > > > these as well as a `@Nonnull List<@Nonnull primitive>`.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. okt.
> 30.,
> > > Sze,
> > > > > > > 5:11):
> > > > > > > >
> > > > > > > >> Hi,
> > > > > > > >>
> > > > > > > >> Recently I tried to fix a bug [1] on parquet-cpp whom is
> > having
> > > a
> > > > > hard
> > > > > > > >> time
> > > > > > > >> reading Parquet file written by parquet-java with
> > > > > > > >> *parquet.avro.write-old-list-structure=true* and with schema
> > > > below:
> > > > > > > >> ```
> > > > > > > >> optional group a (LIST) {
> > > > > > > >>   repeated group array (LIST) {
> > > > > > > >>     repeated int32 array;
> > > > > > > >>   }
> > > > > > > >> }
> > > > > > > >> ```
> > > > > > > >>
> > > > > > > >> The question is whether it should be resolved as
> > > > List<List<Integer>>
> > > > > > or
> > > > > > > >> List<OneTuple<List<Integer>>>. I think it should be the
> former
> > > but
> > > > > the
> > > > > > > >> answer from parquet-cpp is currently the latter.
> > > > > > > >>
> > > > > > > >> It has been explained in [2] but it is not clear on this
> > > specific
> > > > > > case.
> > > > > > > I
> > > > > > > >> have opened a PR to try to clarify it on the spec: [3].
> > > > > > > >>
> > > > > > > >> Any feedback is appreciated!
> > > > > > > >>
> > > > > > > >> [1] https://github.com/apache/arrow/pull/43995
> > > > > > > >> [2]
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> > > > > > > >> [3] https://github.com/apache/parquet-format/pull/466
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Gang
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to