Patrick Wendell created SPARK-4092:
--------------------------------------

             Summary: Input metrics don't work for coalesce()'d RDD's
                 Key: SPARK-4092
                 URL: https://issues.apache.org/jira/browse/SPARK-4092
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Patrick Wendell
            Priority: Critical


In every case where we set input metrics (from both Hadoop and block storage) 
we currently assume that exactly one input partition is computed within the 
task. This is not a correct assumption in the general case. The main example in 
the current API is coalesce(), but user-defined RDD's could also be affected.

To deal with the most general case, we would need to support the notion of a 
single task having multiple input sources. A more surgical and less general fix 
is to simply go to HadoopRDD and check if there are already inputMetrics 
defined for the task with the same "type". If there are, then merge in the new 
data rather than blowing away the old one.

This wouldn't cover case where, e.g. a single task has input from both on-disk 
and in-memory blocks. It _would_ cover the case where someone calls coalesce on 
a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to