steveloughran commented on issue #2205:
URL: 
https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800

   choosing when/how to scan and prefetch in object stores is a real tricky 
business
   
   abfs and gcs connectors do forward prefetch in block sizes you can config in 
hadoop site/job settings, cache into memory. The more prefetching you do, the 
more likely a large process will run out of memory.
   
   s3a doesn't and we've been getting complaints about lack of buffering in the 
client. it does have different seek policies, look at 
fs.s3a.experimental.fadvise and fs.s3a.readahead.range
   
   You can set seek policy cluster-wise, or, if you use the openFile() api, 
when opening specific files.
   
   we have two big bits of work on going there how to help mitigate things, 
both in feature branches right now
   * HADOOP-18103 vectored IO API. It will be available for all 
FSDataInputStream; object stores can improve with range coalescing and fetching 
of different ranges in parallel (s3a will be first for this).
   * HADOOP-18028. High performance S3A input stream with prefetching & caching 
to local disk. feature branch works, but for broader adoption we again need to 
deal with memory/buffer use and some other issues.
   Really good to have you involved in reviewing/testing the vectored IO API 
(yes, we want a native binding too), the prefetching work, and indeed if we can 
get good traces of how your library reads files.
   
   Note also s3a and abfs connectors connect/report stats through the 
IOStatistics interface. Even if you build against Hadoop versions which don't 
have that,
   1.  if you call toString() on the streams you get a good summary of what IO 
took place in that stream only. log this, at debug
   2. on hadoop 3.3.2, set "fs.iostatistics.logging.level"; to info and you get 
full fs stats dump when the fs instance is closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to