bilaharith commented on pull request #2548:
URL: https://github.com/apache/hadoop/pull/2548#issuecomment-760379281


   > I fear you are the same mistake we did in the S3A codebase: giving all the 
helper classes a reference back to the ABFS Store class, so making them too 
intermingled. It is unsustainable. See: 
https://github.com/steveloughran/engineering-proposals/blob/trunk/refactoring-s3a.md
   > 
   > Proposed
   > 
   > * Use a specific listing callback which the store can directly/indirectly 
implement: org.apache.hadoop.fs.s3a.impl.ListingOperationCallbacks.
   > * This will also help you write unit tests against a stub implementation 
without having to use reflection and manipulating access modifiers in a way 
that is over-complex and brittle.
   > * Use IOStatistics API to track duration of calls
   > * and serve this up. It is really interesting to know how long lists take
   > * add a close() operator to cancel queue & wait for completion.
   > 
   > the other incremental listX calls (listFileStatus, listLocatedStatus) are 
all similar and should be done in the same uber-JIRA. listLocatedStatus is used 
in LocatedFileStatusFetcher, which is used during MR and spark file scanning. 
listFileStatus should be used more for deep tree scans.
   > 
   > I'd also prefer if you use a shared thread pool of that ABFS store instance
   > 
   > * stops many, many list iterators overloading things
   > * may offer faster startup
   > * custom thread names help identify origin of less-helpful stack traces
   > * could have a gauge on the instance to measure pool load
   > * filesystem.close() can interrupt the pool to shut things down.
   > 
   > Have a look @ org.apache.hadoop.fs.s3a.Listing in trunk to see what's been 
done there. There's a lot you don't need (S3Guard reconciliation), and it 
currently only prefetches the next page of results. But it does include the 
stats collection, a restricted callback to the FS, use of 
org.apache.hadoop.util.functional.RemoteIterators for functional-programming 
style use of RemoteIterator classes. Take a look at it's 
{{org.apache.hadoop.util.functional.RemoteIterators#foreach}} method too.
   > 
   > I know, I'm giving extra homework. But I'll only keep asking for these, so 
better to start now.
   
   I will raise the JIRA as suggested. For now keeping the PR as a draft.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to