Github user squito commented on the issue:
https://github.com/apache/spark/pull/21212
This makes sense to me. You should update the comment on
`MapOutputTracker.getMapSizesByExecutorId` to mention that it excludes the
zero-sized blocks, and also remove the filter in `ShuffleBlockFetcherIterator`
and fix up the related comments there to mention where the filtering is
happening.
curious, have you observed this using a lot of memory from a heapdump or
anything? I think its a pretty reasonable change either way, I'd just like to
know.
A couple more memory improvement you could throw in here as well
1.
[`statuses.zipWithIndex`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L863)
will create a copy of the array, that is pointless. you could either switch
to `statuses.iterator.zipWithIndex` or just manually iterate over the indices.
2. the final [`splitsByAddress.toSeq`](
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L876)
creates a copy, you could change types so that is not needed. It could return
a map, or if we really want to make sure its not modified, I think you could
have it return an `Iterator` instead of a Seq so that you could just call
`splitsByAddress.iterator`.
3. I wonder if using an ArrayBuffer is the smartest, maybe it'll have a ton
of wasted space from the way it grows. Its also only iterated over, we don't
need fast random access. perhaps a `List` or `Vector` might be better. But
this one I'm less sure about.
to be clear, this isn't changing network traffic at all -- this is after
the executor has received the MapStatus, and its just doing some datastructure
manipulation internal to the executor.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]