m09526 commented on issue #380: URL: https://github.com/apache/arrow-rs-object-store/issues/380#issuecomment-2922384870
> Depends what the intention of "logging" is. If it's for network IO, I agree. If it's to log semantic operations, I think it should be an `ObjectStore` wrapper. This way it could also be used for in-mem or FS-backed stores (or any third-party store). In this case, we want to log the semantic operations, i.e. what `ObjectStore` levle operations has other code called on it. This isn't designed to log at the network layer, and certainly we want it to work across implementations, e.g. local file systems and in-memory backed ones. In our case, we used it to discover how Apache DataFusion was accessing remote object stores. This made analysing it's access patterns on files incredibly quick and easy as we could see _how_ it was retrieving data and how those changed with different DataFusion queries. > With regards to read-ahead, I'm a little torn. The design of object_store is to encourage people to avoid using cursor-based access patterns that warrant this sort of incremental read-ahead pattern - https://docs.rs/object_store/0.12.0/object_store/#why-not-a-filesystem-interface. Instead people should identify the range up front, and fetch and process it, potentially in a streaming fashion where applicable. Maybe read-ahead isn't the most appropriate, descriptive name for it. It's primary purpose is to reduce the number of network requests generated when talking to a remote object store, especially when those requests are chargeable or carry a small but measurable latency. Implemented as a wrapper around any other `ObjectStore` implementation, it acts transparently, re-using data streams where it can. When processing large Parquet files with the logger implementation above, we could view our implementation making 1 ObjectStore request for each Parquet row group each of which translated to 1 network request. For a large file with 100's of row groups, using the read-ahead wrapper, we reduced the number of network requests from 100's to a small handful. That is the intent here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org