steveloughran commented on pull request #33332:
URL: https://github.com/apache/spark/pull/33332#issuecomment-880012155


   You are not going to see this on S3 any more from 404 cacheing.
   That leaves the possibility that some FileOutputFormat doesn't save output 
(e.g. if there's no data). I don't know if that is the case. (I have vague 
memories of something (parquet?) adding the logic to skip reading 0 byte files 
for footers/data, but can't remember the details.)
   
   
   It might be interesting here to actually add a counter of when this happens 
and return that as a stat in the map, at least that way you'll find out in the 
job reports without looking through logs.
   
   Personally, I'd be interested in making the choice of stats tracker for 
hadoop FS insertion something pluggable. the s3a and abfs Input and output 
streams now collect stats on operations (see 
[iostatists](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/iostatistics.md)).
 I'm collecting stats on IO done during task commit in the S3A and ABFS 
committers, aggregating them via the .json manifests and including in the final 
_SUCCESS report. That's handy, but it's not integrated with spark job reporting.
   
   What I would really like to do would be collect stats of read and write 
streams used in a single task, report that, so you'd get details on #of 
throttle events, amout of data discarded in seeks, as the job went along. Which 
implies
   * thread level collection of stats from all streams created in individual 
worker threads
   * committers to retrieve and report 
   
   Even without that, if I could plug in a new stats reporter, I'd be able to 
collect and report stats back from the manifest and s3a committers today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to