Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/18388
Previously I was saying that I have 200k+ connections to one shuffle
service. I'm sorry about this, the information is wrong. It turns out that our
each `NodeManager` has two auxiliary shuffle services, one for Spark and one
for "Hadoop MapReduce". Most of the connections are for "Hadoop MapReduce"
shuffle service. Analyzing the heap dump, there are only 761
connection(`NioSocketChannel`)s to the Spark shuffle service. (How I found
this? Spark shuffle service is using Netty4 for transferring blocks. I found
tons of `org.jboss.netty.channel.socket.nio.NioAcceptedSocketChannel`, checking
Netty code, I found they are only used in Netty3, that's used in our Hadoop.)
So @zsxwing , there is no connection leak in my understanding.
The situation is we have 10K map tasks ran on the same shuffle service and
around 1K reduce tasks fetching the blocks . On java heap I found one
`io.netty.channel.ChannelOutboundBuffer.Entry`(reference one block) will cost
almost 1K bytes and we have 3.5M Entries. When OOM, we have
`io.netty.channel.ChannelOutboundBuffer.Entry`s retaining 3GBytes.
So the problem here is one connection is fetching too many blocks. I
believe tuning `spark.reducer.maxReqsInFlight` or
`spark.reducer.maxBlocksInFlightPerAddress`(#18487) can alleviate this issue.
The question is how to set it appropriately. It seems hard because we need to
make a balance between warehouse performance and stability. After all there are
only 2~3 NodeManagers running OOM, we cannot set
`spark.reducer.maxReqsInFlight` too small to avoid performance degradation.
I checked the connections of one shuffle services yesterday. 5K connections
is very common during the night. It's easy to happen, say there are 5K reduces
running at the same time. What if there are 5 applications and each has 5K
reduces? That will be 25k connections. If each connection is fetching 100
blocks and each `Entry` is 1KB. The memory cost is 2.5G. I think it's too much.
So I'm still proposing concurrency control. Different from current change,
can we control the number of blocks being transferred ? If the number is above
water mark, we can fail the new coming `OpenBlocks`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]