[GitHub] [arrow-datafusion] Dandandan commented on issue #1363: Major performance regression in reading partitioned Parquet data on master

GitBox Thu, 23 Dec 2021 00:02:14 -0800


Dandandan commented on issue #1363:
URL: 
https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-1000115758



   > I'm not sure whether it is relevant or not.  In the current 
implementation, the sync_chunk_reader() method was
   > invoked for every parquet page which will cause lots of unnecessary file 
open and seek calls.
   > 
   > FilePageIterator.next() -> 
FileReader.get_row_group().get_column_page_reader()  -> 
SerializedRowGroupReader.get_column_page_reader() -> 
ChunkObjectReader.get_read() -> LocalFileReader.sync_chunk_reader()
   > 
   > ```
   >      fn sync_chunk_reader(
   >         &self,
   >         start: u64,
   >         length: usize,
   >     ) -> Result<Box<dyn Read + Send + Sync>> {
   >         // A new file descriptor is opened for each chunk reader.
   >         // This okay because chunks are usually fairly large.
   >         let mut file = File::open(&self.file.path)?;
   >         file.seek(SeekFrom::Start(start))?;
   > 
   >         let file = BufReader::new(file.take(length as u64));
   > 
   >         Ok(Box::new(file))
   >     }
   > ```
   > 
   > TPCH Q1:
   > 
   > Read parquet file lineitem.parquet time spent: 590639777 ns, row group 
count 60, skipped row group 0
   > total open/seek count 421, bytes read from FS: 97028517
   > memory alloc size: 1649375985 memory alloc count: 499533 during parquet 
read.
   > 
   > Query 1 iteration 0 took 679.9 ms
   
   I think that might be relevant. How about opening a new issue to track 
improving on that (reusing the file descriptor)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1363: Major performance regression in reading partitioned Parquet data on master

Reply via email to