nevi-me opened a new issue #282:
URL: https://github.com/apache/arrow-rs/issues/282


   **Describe the bug**
   
   First documented in 
https://github.com/apache/arrow-rs/pull/270#issuecomment-836762589.
   
   When trying to write some combinations of nested Arrow data to Parquet, we 
trigger a bounds error on the level calculations.
   The most obvious thing that could be going wrong is that we're not correctly 
accounting for empty list slot vs null list slot.
   
   This is because the error gets triggered around the logic that does this.
   
   **To Reproduce**
   
   Try the below test:
   
   ```rust
   #[test]
   fn test_write_ipc_nested_lists() {
       let fields = vec![Field::new(
           "list_a",
           DataType::List(Box::new(Field::new(
               "list_b",
               DataType::List(Box::new(Field::new(
                   "struct_c",
                   DataType::Struct(vec![
                       Field::new("prim_d", DataType::Boolean, true),
                       Field::new(
                           "list_e",
                           DataType::LargeList(Box::new(Field::new(
                               "string_f",
                               DataType::LargeUtf8,
                               true,
                           ))),
                           false,
                       ),
                   ]),
                   true,
               ))),
               false,
           ))),
           true,
       )];
       let schema = Arc::new(Schema::new(fields));
       // making this nullable guarantees that one of the list items will be 
empty, triggering the error
       let batch = arrow::util::data_gen::create_random_batch(schema, 3, 0.35, 
0.6).unwrap();
   
       // write ipc (to read in pyarrow, and write parquet from pyarrow)
       let file = File::create("arrow_nested_random.arrow").unwrap();
       let mut writer =
           arrow::ipc::writer::FileWriter::try_new(file, 
batch.schema().as_ref()).unwrap();
       writer.write(&batch).unwrap();
       writer.finish().unwrap();
   
       let file = File::create("arrow_nested_random_rust.parquet").unwrap();
       let mut writer =
           ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None)
               .expect("Unable to write file");
   
       // this will trigger the error in question
       writer.write(&batch).unwrap();
       writer.close().unwrap();
   }
   ```
   
   **Expected behavior**
   
   The parquet file should be written correctly, and pyarrow or Spark should be 
able to read the data correctly.
   
   **Additional context**
   
   Not sure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to