Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3120#discussion_r22777883
  
    --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
    @@ -44,7 +44,14 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
         blockManager.get(key) match {
           case Some(blockResult) =>
             // Partition is already materialized, so just return its values
    +        val existingMetrics = context.taskMetrics.inputMetrics
    +        val prevBytesRead = existingMetrics
    +          .filter(_.readMethod == blockResult.inputMetrics.readMethod)
    +          .map(_.bytesRead)
    +          .getOrElse(0L)
    --- End diff --
    
    So what happens if we have input types that intermix here? For instance, 
what if they interleave between two input sources... will they just keep 
clobbering over eachother? It might be better to just chose a single input 
metric and stick with it, i.e. if we happen to be reading a block that wasn't 
derived from the same input as the one before it, just ignore it.
    
    ```
            val blockInput = blockResult.inputMetrics
            context.taskMetrics.inputMetrics match {
              case Some(existingInput) =>
                if (existingInput.readMethod == blockInput.readMethod) {
                  existingInput.bytesRead += blockInput.bytesRead
                }
                // NOTE: If we have interleaving of two input types in one 
task, we currently ignore blocks associated
                //       with all but one type (whichever type was seen first). 
See SPARK-XXX.
              case None =>
                context.taskMetrics.inputMetrics = Some(blockInput)
            }
    ```
    
    It's easier to document that behavior and also add a unit test for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to