Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/1618#issuecomment-54385505
@YanTangZhai Thanks for your PR, but I'm still not sure if we want this
feature. If the data has a lot of skew then the user can look at the logs to
figure that out, since we log how many times we have spilled so far. Some
applications do actually spill many times. Imagine if you have a huge dataset
and somewhat beefy nodes. Every time you spill you're actually doing real work,
but simply by virtue of the large scale of the data your application might die
if you spill too many times. This is an unexpected behavior for these
applications, and we may kill them after many hours of execution on false
alarm. Do you see my point?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]