steveloughran commented on PR #7214:
URL: https://github.com/apache/hadoop/pull/7214#issuecomment-2577690001

   I'm just setting this up so it is ready for the analytics stream 
work...making sure that prefetch is also covered is my way to validate the 
factory model, and that the options need to include things like the options to 
ask for a shared thread pool and stream thread pool, with the intent that 
analytics will use that too.
   
   And once I do that, they all need a single base stream class.
   
   For my vector IO resilience PR, once I have this PR in, I'm going to go back 
to #7105 and make it something which works with all object input streams
   
   
   * probe the stream for being "all in memory"; if so just do the reads 
sequentially, no need to parallelize.
   * if "partially in memory", give implementation that list of ranges and have 
them split into "all in memory" and "needs retrieval". again, in memory blocks 
can be filled in immediately (needs a lock on removing cache items)
   * range coalesce
   * sort by largest range first (stops the tail being the bottleneck)
   * queue for reading
   
   read failure
   1. single range: retry
   2. merged range: complete successfully read parts
   3. and incomplete parts are split into their originals, reread individually 
in same thread, with retries on them
   
   the read failure stuff is essentially in my PR, so maybe we can rebase onto 
this, merge in and then pull up. Goal: analytics stream gets vector IO.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to