Igosuki opened a new issue #1859:
URL: https://github.com/apache/arrow-datafusion/issues/1859
**Describe the bug**
I hit a limitation in the PartitionColumnProjector, when there are too many
partition values. Changing dictionary keys to UInt16 made it work.
**To Reproduce**
Steps to reproduce the behavior:
For instance, I modified the test in `parquet_multiple_partitions` in module
`path_partitions` like this :
```
#[tokio::test]
async fn parquet_multiple_partitions() -> Result<()> {
let mut ctx = ExecutionContext::new();
let store_paths = (0..200)
.into_iter()
.map(|i| {
format!(
"first=what/second=ok/year=2021/month={}/day={}/file.parquet",
i, i
)
})
.collect::<Vec<String>>();
let store_paths_refs = store_paths
.iter()
.map(|s| s.as_str())
.collect::<Vec<&str>>();
register_partitioned_alltypes_parquet(
&mut ctx,
&store_paths_refs,
&["first", "second", "year", "month", "day"],
"",
"alltypes_plain.parquet",
)
.await;
let result = ctx
.sql("SELECT id, day FROM t WHERE day=month and first='what' and
second='ok' and year='2021' ORDER BY id")
.await?
.collect()
.await?;
Ok(())
}
```
At 200 it runs, at 2000 it fails.
**Expected behavior**
Datafusion shouldn't crash, even for thousands of partition values.
**Additional context**
I'm using prepartitioned parquet-files to speed up iterating in notebooks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]