Dandandan commented on pull request #1366: URL: https://github.com/apache/arrow-datafusion/pull/1366#issuecomment-981131326
> Thanks @Dandandan ! Can you quickly explain what the reason for the slowdown was exactly? As far as I can explain: The earlier code used the `parquet`-based API to read from a file, which uses a `BufReader` internally, which is crucial for the performance. By introducing the object storage abstraction, we were directly reading from a `File` instance without any buffering in between, i.e. having lot's of extra calls to the OS (as you also hinted at in #1363). This leads to both slowdown in loading the data but also was very expensive in the part that reads metadata /statistics (which normally takes something like <1ms locally). Probably that part does many small `read` calls. By wrapping the `File` instance in the `BufReader` we avoid those calls to the OS. Maybe a potential improvement would be having a bit more control, such as setting the capacity of the buffer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org