zZKato opened a new issue, #8851:
URL: https://github.com/apache/arrow-datafusion/issues/8851
### Describe the bug
When trying to write a column that is of type List that contains a Struct,
the parquet writer throws an error `Error: ParquetError(General("Incorrect
number of rows, expected 4 != 0 rows"))`. This seems to be a regression as this
works fine in datafusion v32.0.0 but not in v33 or v34. It also works using
`write_json` instead of `write_parquet`
Example dataframe:
```
+------------------------------------+
| filters |
+------------------------------------+
| [{filterTypeId: 3, label: LABEL3}] |
| [{filterTypeId: 2, label: LABEL2}] |
+------------------------------------+
```
### To Reproduce
dependencies (working):
```toml
[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "32.0.0", features = ["backtrace"] }
```
dependencies (broken):
```toml
[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "33.0.0", features = ["backtrace"] }
```
example.json
```json
{"filters":[{"filterTypeId":3,"label":"LABEL3"}]}
{"filters":[{"filterTypeId":2,"label":"LABEL2"}]}
```
main.rs
```rust
use datafusion::{dataframe::DataFrameWriteOptions, error::DataFusionError,
prelude::*};
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let ctx = SessionContext::new();
let df = ctx
.read_json("example.json", NdJsonReadOptions::default())
.await?;
df.write_parquet("result", DataFrameWriteOptions::default(), None)
.await?;
Ok(())
}
```
### Expected behavior
The parquet writer supports writing this kind of datatype as in v32
### Additional context
Maybe related to: https://github.com/apache/arrow-rs/issues/1744
I found this issue trying to debug a different one that came up while trying
to upgrade from v32 to v34. If the struct contains a timestamp the error
instead becomes a `Error: Internal("Unable to send array to writer!")` with a
source error `internal error: entered unreachable code: cannot downcast Int64
to byte array`.
An example of such a df:
```
+---------------------------------------------------------------------------------+
| filters
|
+---------------------------------------------------------------------------------+
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 3, label:
LABEL1}] |
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 2, label:
LABEL2}] |
+--------------------------------------------------------------------------------+
```
I tried to debug this issue myself looking into the `arrow-rs`
implementation however I didn't manage to find the relevant commit that could
have changed this behavior. Also I wasn't sure if I should open the bug in this
project or in the `arrow-rs` project so I hope this is ok 😃.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]