Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21212
  
    This makes sense to me.  You should update the comment on 
`MapOutputTracker.getMapSizesByExecutorId` to mention that it excludes the 
zero-sized blocks, and also remove the filter in `ShuffleBlockFetcherIterator` 
and fix up the related comments there to mention where the filtering is 
happening.
    
    curious, have you observed this using a lot of memory from a heapdump or 
anything?  I think its a pretty reasonable change either way, I'd just like to 
know.
    
    A couple more memory improvement you could throw in here as well
    
    1. 
[`statuses.zipWithIndex`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L863)
 will create a copy of the array, that is pointless.  you could either switch 
to `statuses.iterator.zipWithIndex` or just manually iterate over the indices.
    
    2. the final [`splitsByAddress.toSeq`]( 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L876)
 creates a copy, you could change types so that is not needed.  It could return 
a map, or if we really want to make sure its not modified, I think you could 
have it return an `Iterator` instead of a Seq so that you could just call 
`splitsByAddress.iterator`.
    
    3. I wonder if using an ArrayBuffer is the smartest, maybe it'll have a ton 
of wasted space from the way it grows.  Its also only iterated over, we don't 
need fast random access.  perhaps a `List` or `Vector` might be better.  But 
this one I'm less sure about.
    
    to be clear, this isn't changing network traffic at all -- this is after 
the executor has received the MapStatus, and its just doing some datastructure 
manipulation internal to the executor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to