steveloughran commented on PR #7214: URL: https://github.com/apache/hadoop/pull/7214#issuecomment-2577690001
I'm just setting this up so it is ready for the analytics stream work...making sure that prefetch is also covered is my way to validate the factory model, and that the options need to include things like the options to ask for a shared thread pool and stream thread pool, with the intent that analytics will use that too. And once I do that, they all need a single base stream class. For my vector IO resilience PR, once I have this PR in, I'm going to go back to #7105 and make it something which works with all object input streams * probe the stream for being "all in memory"; if so just do the reads sequentially, no need to parallelize. * if "partially in memory", give implementation that list of ranges and have them split into "all in memory" and "needs retrieval". again, in memory blocks can be filled in immediately (needs a lock on removing cache items) * range coalesce * sort by largest range first (stops the tail being the bottleneck) * queue for reading read failure 1. single range: retry 2. merged range: complete successfully read parts 3. and incomplete parts are split into their originals, reread individually in same thread, with retries on them the read failure stuff is essentially in my PR, so maybe we can rebase onto this, merge in and then pull up. Goal: analytics stream gets vector IO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
