devinjdangelo opened a new issue, #9736:
URL: https://github.com/apache/arrow-datafusion/issues/9736
### Describe the bug
An underflow panic can sometimes be triggered by setting max_row_group_size
to less than execution.batch_size.
### To Reproduce
I found this by trying a wide range of max_row_group_sizes
```rust
#[tokio::main]
async fn main() -> Result<()> {
// create local execution context
let ctx = SessionContext::new();
let testdata = "benchmarks/data/tpch_sf10/lineitem";
let filename = &format!("{testdata}/part-0.parquet");
// define the query using the DataFrame trait
let df = ctx
.read_parquet(filename, ParquetReadOptions::default())
.await?
.limit(0, Some(200_000))?;
println!("{}", df.clone().count().await?);
for row_group_size in (1..8193).step_by(283).rev(){
println!("row group size: {}", row_group_size);
println!("Writing without parallelism!");
let row_group_path = format!("/tmp/{}.parquet", row_group_size);
let mut options = TableParquetOptions::default();
options.global.max_row_group_size = row_group_size;
options.global.allow_single_file_parallelism = false;
df.clone().write_parquet(
&row_group_path,
DataFrameWriteOptions::new().with_single_file_output(true),
Some(options),
)
.await
.unwrap();
println!("Writing with parallelism!");
let row_group_path = format!("/tmp/para_{}.parquet", row_group_size);
let mut options = TableParquetOptions::default();
options.global.max_row_group_size = row_group_size;
options.global.allow_single_file_parallelism = true;
df.clone().write_parquet(
&row_group_path,
DataFrameWriteOptions::new().with_single_file_output(true),
Some(options),
)
.await
.unwrap();
}
Ok(())
}
```
### Expected behavior
No combination of max_row_group_size and execution.batch_size should lead to
panic
### Additional context
Extremely tiny max_row_group sizes can cause a stack overflow error even if
parallel_parquet writer is disabled. E.g. a max row group size of 1. We may
want to raise a configuration validation error for absurdly small row group
sizes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]