[GitHub] [arrow-datafusion] rdettai edited a comment on issue #1363: Major performance regression in reading partitioned Parquet data on master

GitBox Fri, 26 Nov 2021 07:02:16 -0800


rdettai edited a comment on issue #1363:
URL: 
https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-980013409



   ~~One possible reason might be that #1010 introduces the use of the 
`ObjectStore`~~: 
https://github.com/apache/arrow-datafusion/blob/414c826bf06fd22e0bb52edbb497791b5fe558e0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L411
 
   
   ~~The abstraction requires the use of **dynamic dispatch on the reader** 
(`fn sync_chunk_reader(&self,start: u64, length: usize) -> Result<Box<dyn Read 
+ Send + Sync>>`), which can indeed reduce performances if `read()` is called a 
lot. Actually, now that I'm thinking, some old memories are coming back: if I 
remember correctly, 2 years ago when I was first playing with the parquet 
reader I noticed that something like this was happening. `read()` was called in 
a way that it was often getting only 1 byte at a time.~~
   
   EDIT: I tried reverting just that part back to the original `std::fs::File` 
wrapped with a `SerializedFileReader` and the performances remain as bad! 
Hypothesis invalidated! The issue is not that dynamic dispatch 😕.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #1363: Major performance regression in reading partitioned Parquet data on master

Reply via email to