holdenk commented on pull request #29179:
URL: https://github.com/apache/spark/pull/29179#issuecomment-663753558


   > > Interesting. Is this specific to the S3A impl or is there a higher base 
class? I want to make it work with multiple file formats if possible.
   > 
   > it's in hadoop common with an interface IOStatisticsSource which can be 
implemented by anything that feels like it; there's passthrough in the core 
hadoop io stream/compression classes, and the MR located status fetcher 
collates it (IOStatisticsSnapshot is a static snapshot which does aggregation, 
is serializable via java object streams (your code could return it) and JSON 
(s3a committer will report what it collects)
   > 
   > Although I'm using the S3A codebase to drive that API and the support 
classes, we've been getting ABFS ready for it too; should only take a single 
patch to move it over to this as well.
   > 
   > There's been an API to get counters in the S3A streams for a while, but 
its private and unstable, so those people who wanted at it (impala) couldn't 
safely do so. This gives everyone something public with more things collected 
and aggregation thereof.
   > 
   
   Sounds good, so we can report the file listing times to that if the provided 
source supports it.
   
   > > The idea here is to push it out to the workers (in part per-host rate 
limiting) but also matching the code we have in the SQL side so we have less 
maintianence cost.
   > 
   > what's doing the throttling here?
   
   Remote service normally. If we wanted we could make the fan-out a separate 
part from the rest of it; although I'd rather keep them together for the 
unified code path.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to