Charles Reiss created SPARK-4157:
------------------------------------
Summary: Task input statistics incomplete when a task reads from
multiple locations
Key: SPARK-4157
URL: https://issues.apache.org/jira/browse/SPARK-4157
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Charles Reiss
Priority: Minor
SPARK-1683 introduced tracking of filesystem reads for tasks, but the tracking
code assumes that each task reads from exactly one file/cache block, and
replaces any prior InputMetrics object for a task after each read.
But, for example, a task computing a shuffle-less join (input RDDs are
prepartitioned by key) may read two or more cached dependency RDD blocks from
cache. In this case, the displayed input size will be for whichever dependency
was requested last.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]