andrei-ionescu opened a new issue #1383:
URL: https://github.com/apache/arrow-datafusion/issues/1383
**Describe the bug**
Reading wide and nested parquet files results in `index out of bounds` error
as seen bellow:
```
thread 'main' panicked at 'index out of bounds: the len is 17 but the index
is 17', /Users/xxxx/.cargo/registry/
src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
```
**To Reproduce**
1. Download attached zipped parquet file and unzip it:
[wide_schema_1row.parquet.zip](https://github.com/apache/arrow-datafusion/files/7621520/wide_schema_1row.parquet.zip)
2. Place it in a `./data` folder
3. Execute the following code:
```rust
let mut ctx = ExecutionContext::new();
let df = ctx.read_parquet("./data/wide_schema_1row.parquet").await?;
df.show().await
```
4. The result is `index out of bounds` panic
```
thread 'main' panicked at 'index out of bounds: the len is 17 but the index
is 17',
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
stack backtrace:
0: rust_begin_unwind
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
2: core::panicking::panic_bounds_check
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:75:5
3: <usize as core::slice::index::SliceIndex<[T]>>::index_mut
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:190:14
4: core::slice::index::<impl core::ops::index::IndexMut<I> for
[T]>::index_mut
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/slice/index.rs:26:9
5: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2540:9
6: datafusion::datasource::file_format::parquet::fetch_metadata
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:272:13
7: <datafusion::datasource::file_format::parquet::ParquetFormat as
datafusion::datasource::file_format::FileFormat>::infer_schema::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/file_format/parquet.rs:96:27
8: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
9: <core::pin::Pin<P> as core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/future.rs:119:9
10:
datafusion::datasource::listing::table::ListingOptions::infer_schema::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/datasource/listing/table.rs:99:27
11: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
12:
datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet_with_name::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:287:31
13: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
14:
datafusion::logical_plan::builder::LogicalPlanBuilder::scan_parquet::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/logical_plan/builder.rs:255:9
15: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
16:
datafusion::execution::context::ExecutionContext::read_parquet::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/execution/context.rs:403:13
17: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
18: read_parquet::main::{{closure}}
at ./src/main.rs:79:14
19: <core::future::from_generator::GenFuture<T> as
core::future::future::Future>::poll
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/future/mod.rs:80:19
20: tokio::park::thread::CachedParkThread::block_on::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:54
21: tokio::coop::with_budget::{{closure}}
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
22: std::thread::local::LocalKey<T>::try_with
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:399:16
23: std::thread::local::LocalKey<T>::with
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/thread/local.rs:375:9
24: tokio::coop::with_budget
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
25: tokio::coop::budget
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
26: tokio::park::thread::CachedParkThread::block_on
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/park/thread.rs:263:31
27: tokio::runtime::enter::Enter::block_on
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:151:13
28: tokio::runtime::thread_pool::ThreadPool::block_on
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/mod.rs:77:9
29: tokio::runtime::Runtime::block_on
at
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/mod.rs:463:43
30: read_parquet::main
at ./src/main.rs:80:5
31: core::ops::function::FnOnce::call_once
at
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/ops/function.rs:227:5
```
**Expected behavior**
To properly read the parquet file.
**Additional context**
After debugging a bit the issue the error happens in `fetch_statistics`
function. To be more precise the `schema.fields().len()`
[datasource/file_format/parquet.rs#L261](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/file_format/parquet.rs#L261)
construct returns only the top fields, while the `row_group_meta.columns()`
([datasource/file_format/parquet.rs#L276-L277](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/file_format/parquet.rs#L276-L277))
returns all leaves.
In the context of the given parquet file, there are 17 top level fields and
about 262 leaves.
DataFusion is `6.0`
Rust is `1.58.0-nightly (65c55bf93 2021-11-23)`
Cargo is `1.58.0-nightly (e1fb17631 2021-11-22)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]