Dandandan commented on issue #1363: URL: https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-1000115758
> I'm not sure whether it is relevant or not. In the current implementation, the sync_chunk_reader() method was > invoked for every parquet page which will cause lots of unnecessary file open and seek calls. > > FilePageIterator.next() -> FileReader.get_row_group().get_column_page_reader() -> SerializedRowGroupReader.get_column_page_reader() -> ChunkObjectReader.get_read() -> LocalFileReader.sync_chunk_reader() > > ``` > fn sync_chunk_reader( > &self, > start: u64, > length: usize, > ) -> Result<Box<dyn Read + Send + Sync>> { > // A new file descriptor is opened for each chunk reader. > // This okay because chunks are usually fairly large. > let mut file = File::open(&self.file.path)?; > file.seek(SeekFrom::Start(start))?; > > let file = BufReader::new(file.take(length as u64)); > > Ok(Box::new(file)) > } > ``` > > TPCH Q1: > > Read parquet file lineitem.parquet time spent: 590639777 ns, row group count 60, skipped row group 0 > total open/seek count 421, bytes read from FS: 97028517 > memory alloc size: 1649375985 memory alloc count: 499533 during parquet read. > > Query 1 iteration 0 took 679.9 ms I think that might be relevant. How about opening a new issue to track improving on that (reusing the file descriptor)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org