Hadoop 2.8's s3a does a lot more metrics here, most of which you can find on HDP-2.5 if you can grab those JARs. Everything comes out as hadoop JMX metrics, also readable & aggregatable through a call to FileSystem.getStorageStatistics
Measuring IO time isn't something picked up, because it's actually hard to measure in a multihreaded world: you can't just count up the time seconds spent talking to s3, because it can actually make things seem unduly negative. I know that as we did try adding up upload times & bytes uploaded to give a metric of bandwidth, but it turns out to be fairly misleading. If 10 threads each took 60s to upload a megabyte of data you could conclude that B/W is 1 MB every 10 minutes...if they were running in parallel it's a B/W of 10MB per minute: 10x as fast. FWIW the main bottlenecks in s3a perf are 1. time for metadata operations. This is shockingly bad: http://steveloughran.blogspot.co.uk/2016/12/how-long-does-filesystemexists-take.html As well as code improvements up the stack, you can help here by not having deep directory structures for partitioning; prefer wider trees. 2. cost of re-opening HTTPS connection after forward/backward seek. Hadoop 2.8 s3a does a lot of work here to improve things, through forward seeks of many KB before abort/restart the connectin, and an fadvise=random option for max perf on column table storage (ORC, Parquet) 3. how s3a waits until close() before uploading data. The fs.s3a.fast.output.enabled=true option boosts this, but it's pretty brittle in Hadoop 2.7.x as it uses lots of on-heap storage if code is generating faster than upload B/W; 2.8 can use HDD as buffering. 4. time to commit data. This is an O(data) copy server side, at 6-10 MB/s. Needs a committer which doesn't do renames. The 1.6 DirectOutputCommitter did, but it couldn't handle failure & retry. Future ones will. -Steve ________________________________ From: Gili Nachum <gilinac...@gmail.com> Sent: 13 February 2017 06:55 To: user@spark.apache.org Subject: How to measure IO time in Spark over S3 Hi! How can I tell IO duration for a Spark application doing R/W from S3 (using S3 as a filesystem sc.textFile("s3a://...")? I would like to know the % of time doing IO of the overall app execution time. Gili.