Patrick Wendell created SPARK-4092:
--------------------------------------
Summary: Input metrics don't work for coalesce()'d RDD's
Key: SPARK-4092
URL: https://issues.apache.org/jira/browse/SPARK-4092
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical
In every case where we set input metrics (from both Hadoop and block storage)
we currently assume that exactly one input partition is computed within the
task. This is not a correct assumption in the general case. The main example in
the current API is coalesce(), but user-defined RDD's could also be affected.
To deal with the most general case, we would need to support the notion of a
single task having multiple input sources. A more surgical and less general fix
is to simply go to HadoopRDD and check if there are already inputMetrics
defined for the task with the same "type". If there are, then merge in the new
data rather than blowing away the old one.
This wouldn't cover case where, e.g. a single task has input from both on-disk
and in-memory blocks. It _would_ cover the case where someone calls coalesce on
a HadoopRDD... which is more common.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]