m09526 commented on issue #380:
URL: 
https://github.com/apache/arrow-rs-object-store/issues/380#issuecomment-2922384870

   > Depends what the intention of "logging" is. If it's for network IO, I 
agree. If it's to log semantic operations, I think it should be an 
`ObjectStore` wrapper. This way it could also be used for in-mem or FS-backed 
stores (or any third-party store).
   
   In this case, we want to log the semantic operations, i.e. what 
`ObjectStore` levle operations has other code called on it. This isn't designed 
to log at the network layer, and certainly we want it to work across 
implementations, e.g. local file systems and in-memory backed ones. In our 
case, we used it to discover how Apache DataFusion was accessing remote object 
stores. This made analysing it's access patterns on files incredibly quick and 
easy as we could see _how_ it was retrieving data and how those changed with 
different DataFusion queries.
   
   > With regards to read-ahead, I'm a little torn. The design of object_store 
is to encourage people to avoid using cursor-based access patterns that warrant 
this sort of incremental read-ahead pattern - 
https://docs.rs/object_store/0.12.0/object_store/#why-not-a-filesystem-interface.
 Instead people should identify the range up front, and fetch and process it, 
potentially in a streaming fashion where applicable.
   
   Maybe read-ahead isn't the most appropriate, descriptive name for it. It's 
primary purpose is to reduce the number of network requests generated when 
talking to a remote object store, especially when those requests are chargeable 
or carry a small but measurable latency. Implemented as a wrapper around any 
other `ObjectStore` implementation, it acts transparently, re-using data 
streams where it can. When processing large Parquet files with the logger 
implementation above, we could view our implementation making 1 ObjectStore 
request for each Parquet row group each of which translated to 1 network 
request. For a large file with 100's of row groups, using the read-ahead 
wrapper, we reduced the number of network requests from 100's to a small 
handful. That is the intent here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to