[
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623834#comment-14623834
]
Abhishek Modi commented on SPARK-9004:
--------------------------------------
Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters.
Spark combines all of them in its metrics. Separating them could give a better
idea of IO distribution.
Here's how it works in MR:
1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to
the RM which then launches the AM etc.
2. After job submission, Client continuously monitors the job to see if it is
finished.
3. Once the job is finished, the client gets the counters of the job via the
getCounters() function.
4. It logs on the client using "Counters=" format.
I don't really know how to implement it. Can it be done by modifying
NewHadoopRDD because i guess that's where the Job object is being used ?
> Add s3 bytes read/written metrics
> ---------------------------------
>
> Key: SPARK-9004
> URL: https://issues.apache.org/jira/browse/SPARK-9004
> Project: Spark
> Issue Type: Improvement
> Reporter: Abhishek Modi
> Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate
> data processed
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]