Github user kayousterhout commented on a diff in the pull request:
https://github.com/apache/spark/pull/962#discussion_r14213126
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
---
@@ -67,6 +67,12 @@ class TaskMetrics extends Serializable {
var diskBytesSpilled: Long = _
/**
+ * If this task reads from a HadoopRDD, from cached data, or from a
parallelized collection,
--- End diff --
Yeah I wasn't totally sure what the right thing to do in that case was, but
I eventually decided on what you said -- that since we're just reading back
what we put in, it doesn't make sense to add it to the input size.
On Wed, Jun 25, 2014 at 2:15 PM, andrewor14 <[email protected]>
wrote:
> In core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
>
> > @@ -67,6 +67,12 @@ class TaskMetrics extends Serializable {
> > var diskBytesSpilled: Long = _
> >
> > /**
> > + * If this task reads from a HadoopRDD, from cached data, or from a
parallelized collection,
>
> Oh hm, looks like in the DISK_ONLY case for block manager we don't set
> the input bytes (I'm referring to CacheManager.scala#L130
>
<https://github.com/kayousterhout/spark-1/blob/read_metrics/core/src/main/scala/org/apache/spark/CacheManager.scala#L130>),
> which I guess is correct because we're just reading back the bytes we just
> put in.
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/962/files#r14212743>.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---