[I] Cannot write a column of type `DataType::List` containing a `DataType::Struct` to parquet [arrow-datafusion]

via GitHub Sat, 13 Jan 2024 04:03:57 -0800


zZKato opened a new issue, #8851:
URL: https://github.com/apache/arrow-datafusion/issues/8851


   ### Describe the bug
   
   When trying to write a column that is of type List that contains a Struct, 
the parquet writer throws an error `Error: ParquetError(General("Incorrect 
number of rows, expected 4 != 0 rows"))`. This seems to be a regression as this 
works fine in datafusion v32.0.0 but not in v33 or v34. It also works using 
`write_json` instead of `write_parquet`
   
   Example dataframe:
   ```
   +------------------------------------+
   | filters                            |
   +------------------------------------+
   | [{filterTypeId: 3, label: LABEL3}] |
   | [{filterTypeId: 2, label: LABEL2}] |
   +------------------------------------+
   ```
   
   ### To Reproduce
   
   dependencies (working):
   ```toml
   [dependencies]
   tokio = { version = "1.35.1", features = ["macros"] }
   datafusion = { version = "32.0.0", features = ["backtrace"] }
   
   ```
   
   dependencies (broken):
   ```toml
   [dependencies]
   tokio = { version = "1.35.1", features = ["macros"] }
   datafusion = { version = "33.0.0", features = ["backtrace"] }
   
   ```
   
   example.json
   ```json
   {"filters":[{"filterTypeId":3,"label":"LABEL3"}]}
   {"filters":[{"filterTypeId":2,"label":"LABEL2"}]}
   ```
   
   main.rs
   ```rust
   use datafusion::{dataframe::DataFrameWriteOptions, error::DataFusionError, 
prelude::*};
   
   #[tokio::main]
   async fn main() -> Result<(), DataFusionError> {
       let ctx = SessionContext::new();
   
       let df = ctx
           .read_json("example.json", NdJsonReadOptions::default())
           .await?;
   
       df.write_parquet("result", DataFrameWriteOptions::default(), None)
           .await?;
       Ok(())
   }
   
   ```
   
   ### Expected behavior
   
   The parquet writer supports writing this kind of datatype as in v32
   
   ### Additional context
   
   Maybe related to: https://github.com/apache/arrow-rs/issues/1744
   
   I found this issue trying to debug a different one that came up while trying 
to upgrade from v32 to v34. If the struct contains a timestamp the error 
instead becomes a `Error: Internal("Unable to send array to writer!")` with a 
source error `internal error: entered unreachable code: cannot downcast Int64 
to byte array`.
   
   An example of such a df:
   ```
   
+---------------------------------------------------------------------------------+
   | filters                                                                    
     |
   
+---------------------------------------------------------------------------------+
   | [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 3, label: 
LABEL1}] |
   | [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 2, label: 
LABEL2}] |
   
+--------------------------------------------------------------------------------+
   ```
   
   I tried to debug this issue myself looking into the `arrow-rs` 
implementation however I didn't manage to find the relevant commit that could 
have changed this behavior. Also I wasn't sure if I should open the bug in this 
project or in the `arrow-rs` project so I hope this is ok 😃. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Cannot write a column of type `DataType::List` containing a `DataType::Struct` to parquet [arrow-datafusion]

Reply via email to