Github user tgravescs commented on the pull request:
https://github.com/apache/spark/pull/12113#issuecomment-217521502
right you can still run out of memory if under the broadcast limit. The
only way to prevent that would be really have flow control in place, which I
originally had some (see jira) but the problem comes down to throttling vs
sending fast. Netty unfortunately doesn't really support that either so you
have to do it on your own. Eventually we could look at just getting rid of
the non-broadcast method but I would want to do some more performance testing
on that. The threadpool in the non-broadcast case just allows synchronizing so
multiple threads aren't serializing the same thing in parallel.
yes if you read the jira you can see my reasoning and further things I
considered. Personally I think that limits you for future enhancements. For
instance being able start reduce tasks before all map tasks finish. Being
able to refetch map statuses without restarting entire tasks. If we weren't
doing broadcast you could do more chunking. You also end up with a very large
task data. I didn't spend a lot of time looking at that path but I think you
could have the same issue there. When we launch a bunch of tasks we just have
a for loop sending them. So if we are shoving the map statuses into the task
info which is being sent over netty and we can't send them fast enough you get
the memory bloat again.
If we still have problems with this then we could look at doing that or
something else, but at this point this has been working very well for us (if
you read the description you can see my results).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]