m-mueller678 opened a new issue, #7667: URL: https://github.com/apache/arrow-rs/issues/7667
**Which part is this question about** library api: multithreaded reading without converting to arrow **Describe your question** I am trying to read a single file from multiple threads. The `ColumnReader`s return lots of protocol errors and sometimes panics. I am not sure if this is a bug or if I am using the library wrong. If the latter, I'd love to know what the correct way is. **Additional context** I am guessing that the issue is that multiple `RowGroupReader`s share the same file handle and interfere with each other by seeking. However, `SerializedFileReader::get_row_group` seems to be inviting me to do exactly this kind of sharing by taking a shared reference to the `SerializedFileReader`. In my example, I use the lineitem table from TPC-H, generated using [tpchgen-cli](https://github.com/clflushopt/tpchgen-rs): ```sh mkdir tpch-data cd tpch-data tpchgen-cli -s 1 --format=parquet cd .. ``` Here is the code: ```rust use parquet::column::reader::ColumnReader; use parquet::file::metadata::RowGroupMetaData; use parquet::file::reader::{FileReader, RowGroupReader, SerializedFileReader}; use rayon::prelude::*; fn find_col(metadata: &RowGroupMetaData, reader: &dyn RowGroupReader, name: &str) -> ColumnReader { for (i, x) in metadata.columns().iter().enumerate() { if x.column_descr().name() == name { return reader.get_column_reader(i).unwrap(); } } panic!("column {name:?} not found"); } fn main() { let reader = SerializedFileReader::new(std::fs::File::open("./tpch-data/lineitem.parquet").unwrap()) .unwrap(); let metadata = reader.metadata(); (0..metadata.num_row_groups()) .into_par_iter() .for_each(|i| { let metadata = &metadata.row_group(i); let reader = reader.get_row_group(i).unwrap(); let ColumnReader::Int64ColumnReader(mut reader_l_quantity_112) = find_col(metadata, &*reader, "l_quantity") else { panic!() }; let mut read_buffer_l_quantity_113 = Vec::new(); loop { let read_count_126 = reader_l_quantity_112 .read_records(10000, None, None, &mut read_buffer_l_quantity_113) .unwrap() .0; if read_count_126 == 0 { break; } } }) } ``` Here are some of the errors I am seeing: ``` thread '<unnamed>' panicked at src/bin/parquet_issue.rs:34:22: called `Result::unwrap()` on an `Err` value: External(ProtocolError { kind: Unknown, message: "cannot skip field type Stop" }) thread '<unnamed>' panicked at src/bin/parquet_issue.rs:34:22: called `Result::unwrap()` on an `Err` value: External(ProtocolError { kind: Unknown, message: "missing required field PageHeader.type_" }) thread '<unnamed>' panicked at $HOME/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-55.1.0/src/encodings/rle.rs:485:58: index out of bounds: the len is 50 but the index is 58 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org