Dandandan commented on pull request #1366:
URL: https://github.com/apache/arrow-datafusion/pull/1366#issuecomment-981131326


   > Thanks @Dandandan ! Can you quickly explain what the reason for the 
slowdown was exactly?
   
   As far as I can explain:
   
   The earlier code used the `parquet`-based API to read from a file, which 
uses a `BufReader` internally, which is crucial for the performance.
   
   By introducing the object storage abstraction, we were directly reading from 
a `File` instance without any buffering in between, i.e. having lot's of extra 
calls to the OS (as you also hinted at in #1363).
   This leads to both slowdown in loading the data but also was very expensive 
in the part that reads metadata /statistics (which normally takes something 
like <1ms locally). Probably that part does many small `read` calls.
   
   By wrapping the `File` instance in the `BufReader` we avoid those calls to 
the OS.
   
   Maybe a potential improvement would be having a bit more control, such as 
setting the capacity of the buffer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to