etseidl opened a new issue, #6756:
URL: https://github.com/apache/arrow-rs/issues/6756

   **Describe the bug**
   A file with the schema
   ```
   message my_record {
     REQUIRED group a (LIST) {
       REPEATED group array (LIST) {
         REPEATED INT32 array;
       }
     }
   }
   ```
   is currently read by arrow-rs as a `list<struct<list<int32>>`, i.e. a list 
of a one-tuple encapsulating a list of integers. Consensus is forming around 
the notion that this should instead be a nested list of integer lists (see 
https://github.com/apache/parquet-format/pull/466 and 
https://github.com/apache/arrow/pull/43995).
   
   **To Reproduce**
   Run parquet-rewrite on the file `old_list_structure.parquet` in 
`parquet-testing/data` and print the schema from the resulting file.
   ```
   % parquet-rewrite -i old_list_structure.parquet -o old.pq
   % parquet-schema old.pq
   Metadata for file: old.pq
   
   version: 1
   num of rows: 1
   created by: parquet-rs version 53.2.0
   metadata:
     parquet.avro.schema: 
{"type":"record","name":"my_record","fields":[{"name":"a","type":{"type":"array","items":{"type":"array","items":"int"}}}]}
     writer.model.name: avro
     ARROW:schema: 
/////wABAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAnP///xgAAAAMAAAAAAAADLQAAAABAAAACAAAAMD///+8////GAAAAAwAAAAAAAANiAAAAAEAAAAIAAAA4P///9z///8cAAAADAAAAAAAAAxcAAAAAQAAABwAAAAEAAQABAAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAEAAABhAAAA
   message arrow_schema {
     REQUIRED group a (LIST) {
       REPEATED group list {
         REQUIRED group array {
           REQUIRED group array (LIST) {
             REPEATED group list {
               REQUIRED INT32 array;
             }
           }
         }
       }
     }
   }
   ```
   
   **Expected behavior**
   The test file should be read as nested lists and produce the following 
schema:
   ```
   message arrow_schema {
     REQUIRED group a (LIST) {
       REPEATED group list {
         REQUIRED group array (LIST) {
           REPEATED group list {
             REQUIRED INT32 array;
           }
         }
       }
     }
   }
   ```
   
   **Additional context**
   The root cause is the naming of the repeated group as "array". This causes 
the code that handles legacy lists to use a rule which states:
   > If the repeated field is a group with one field and is named either 
`array` or uses the `LIST`-annotated group's name with `_tuple` appended then 
the repeated type is the element type and elements are required.
   
   This rule should not apply due to a) the child of the repeated group "array" 
also having `repeated` repetition, and b) the `LIST` annotation on the repeated 
group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to