[ 
https://issues.apache.org/jira/browse/HADOOP-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911085#comment-17911085
 ] 

ASF GitHub Bot commented on HADOOP-19354:
-----------------------------------------

steveloughran commented on PR #7214:
URL: https://github.com/apache/hadoop/pull/7214#issuecomment-2577690001

   I'm just setting this up so it is ready for the analytics stream 
work...making sure that prefetch is also covered is my way to validate the 
factory model, and that the options need to include things like the options to 
ask for a shared thread pool and stream thread pool, with the intent that 
analytics will use that too.
   
   And once I do that, they all need a single base stream class.
   
   For my vector IO resilience PR, once I have this PR in, I'm going to go back 
to #7105 and make it something which works with all object input streams
   
   
   * probe the stream for being "all in memory"; if so just do the reads 
sequentially, no need to parallelize.
   * if "partially in memory", give implementation that list of ranges and have 
them split into "all in memory" and "needs retrieval". again, in memory blocks 
can be filled in immediately (needs a lock on removing cache items)
   * range coalesce
   * sort by largest range first (stops the tail being the bottleneck)
   * queue for reading
   
   read failure
   1. single range: retry
   2. merged range: complete successfully read parts
   3. and incomplete parts are split into their originals, reread individually 
in same thread, with retries on them
   
   the read failure stuff is essentially in my PR, so maybe we can rebase onto 
this, merge in and then pull up. Goal: analytics stream gets vector IO.
   
   




> S3A: InputStreams to be created by factory under S3AStore
> ---------------------------------------------------------
>
>                 Key: HADOOP-19354
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19354
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.2
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> Migrate S3AInputStream creation into a factory pattern, push down into 
> S3AStore.
> Proposed factories
> * default: whatever this release has as default
> * classic: current S3AInputStream
> * prefetch: prefetching
> * analytics: new analytics stream
> * other: reads a classname from another prop, instantiates.
> Also proposed
> * stream to implement some stream capability to declare what they are 
> (classic, prefetch, analytics, other). 
> h2. Implementation
> All callbacks used by the stream also to call directly onto S3AStore.
> S3AFileSystem must not be invoked at all (if it is needed: PR is still not 
> ready).
> Some interface from Instrumentation will be passed to factory; this shall 
> include a way to create new per-stream 
> The factory shall implement org.apache.hadoop.service.Service; S3AStore shall 
> do same and become a subclass of CompositeService. It shall attach the 
> factory as a child, so they can follow the same lifecycle. We shall do the 
> same for anything else that gets pushed down.
> Everything related to stream creation must go from s3afs; and creation of the 
> factory itself. This must be done in S3AStore.initialize(). 
> As usual, this will complicate mocking. But the streams themselves should not 
> require changes, at least significant ones.
> Testing.
> * The huge file tests should be tuned so each of the different ones uses a 
> different stream, always.
> * use a -Dstream="factory name" to choose factory, rather than the -Dprefetch
> * if not set, whatever is in auth-keys gets picked up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to