[ 
https://issues.apache.org/jira/browse/SPARK-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190974#comment-14190974
 ] 

Kostas Sakellis commented on SPARK-4092:
----------------------------------------

The current patch I'm working on does the simplest thing to address this issue. 
In the hadoop rdds and cache manager if the task already has input metrics of 
the same read method it will increment them instead of overriding. This simple 
solution should handle the common case of:
{code}
sc.textFile(..).coalesce(5).collect()
{code}
In addition it will cover the situation where blocks are coming from the cache. 

What it won't handle properly is if coalesce (or other rdds with similar 
properties) reads from multiple blocks of mixed read methods (memory vs. 
hadoop). In that case, one input metric will override the other. We have 
several options:
# We create a MIXED readMethod and if we see input metrics from different 
methods, we change to MIXED. This will loose some information because now we 
don't know where the blocks were read from.
# We store multiple inputMetrics for each TaskContext. Up the stack (eg. 
JsonProtocol) we send the array of InputMetrics to the caller. We have to worry 
about backwards compatibility in that case so we can't just remove the single 
inputMetric. We might have to send back a MIXED metric and in addition the 
array for newer clients.
# We punt on this issue for now since it can be argued is not common.
What are people's thoughts on this?



> Input metrics don't work for coalesce()'d RDD's
> -----------------------------------------------
>
>                 Key: SPARK-4092
>                 URL: https://issues.apache.org/jira/browse/SPARK-4092
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Kostas Sakellis
>            Priority: Critical
>
> In every case where we set input metrics (from both Hadoop and block storage) 
> we currently assume that exactly one input partition is computed within the 
> task. This is not a correct assumption in the general case. The main example 
> in the current API is coalesce(), but user-defined RDD's could also be 
> affected.
> To deal with the most general case, we would need to support the notion of a 
> single task having multiple input sources. A more surgical and less general 
> fix is to simply go to HadoopRDD and check if there are already inputMetrics 
> defined for the task with the same "type". If there are, then merge in the 
> new data rather than blowing away the old one.
> This wouldn't cover case where, e.g. a single task has input from both 
> on-disk and in-memory blocks. It _would_ cover the case where someone calls 
> coalesce on a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to