[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 merged to master. Thanks @attilapiros ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 Here is the new task for the metrics: https://issues.apache.org/jira/browse/SPARK-24594. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92040/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #92040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92040/testReport)** for PR 21068 at commit [`f71c7c5`](https://github.com/apache/spark/commit/f71c7c547f173c902eec101d510c87d50c7abb86). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #92040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92040/testReport)** for PR 21068 at commit [`f71c7c5`](https://github.com/apache/spark/commit/f71c7c547f173c902eec101d510c87d50c7abb86). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92031/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #92031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92031/testReport)** for PR 21068 at commit [`f71c7c5`](https://github.com/apache/spark/commit/f71c7c547f173c902eec101d510c87d50c7abb86). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #92031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92031/testReport)** for PR 21068 at commit [`f71c7c5`](https://github.com/apache/spark/commit/f71c7c547f173c902eec101d510c87d50c7abb86). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 Looks like it was modified to kill if all nodes blacklisted so I'm good with this approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91920/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91920/testReport)** for PR 21068 at commit [`a462ce0`](https://github.com/apache/spark/commit/a462ce0f929fbd18e708dfc19ca6ad3af8b41315). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91920/testReport)** for PR 21068 at commit [`a462ce0`](https://github.com/apache/spark/commit/a462ce0f929fbd18e708dfc19ca6ad3af8b41315). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 lgtm will leave open for a couple of days to let @tgravescs take a look --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91907/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91907/testReport)** for PR 21068 at commit [`a462ce0`](https://github.com/apache/spark/commit/a462ce0f929fbd18e708dfc19ca6ad3af8b41315). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91905/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91905/testReport)** for PR 21068 at commit [`848d050`](https://github.com/apache/spark/commit/848d050eda54f31b14286af966dc9358e35658a6). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 Retested manually on a cluster with the result the PR's description is updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91907/testReport)** for PR 21068 at commit [`a462ce0`](https://github.com/apache/spark/commit/a462ce0f929fbd18e708dfc19ca6ad3af8b41315). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91905/testReport)** for PR 21068 at commit [`848d050`](https://github.com/apache/spark/commit/848d050eda54f31b14286af966dc9358e35658a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91860/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91860/testReport)** for PR 21068 at commit [`aa52f6e`](https://github.com/apache/spark/commit/aa52f6edb998d21e51d0d9a73351548034515a8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91860/testReport)** for PR 21068 at commit [`aa52f6e`](https://github.com/apache/spark/commit/aa52f6edb998d21e51d0d9a73351548034515a8e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91779/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91779/testReport)** for PR 21068 at commit [`7fce4ee`](https://github.com/apache/spark/commit/7fce4eec7294abb071200f1674293bfc2089f82b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91779/testReport)** for PR 21068 at commit [`7fce4ee`](https://github.com/apache/spark/commit/7fce4eec7294abb071200f1674293bfc2089f82b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91764/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91764/testReport)** for PR 21068 at commit [`61f3d17`](https://github.com/apache/spark/commit/61f3d1718072c252298b6d8ddcca333d1cf122a3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91764/testReport)** for PR 21068 at commit [`61f3d17`](https://github.com/apache/spark/commit/61f3d1718072c252298b6d8ddcca333d1cf122a3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 Tom and I had a chance to discuss this in person, and after some back and forth I think we decided that maybe its best to remove the limit but have the application fail if the entire cluster is blacklisted. @tgravescs does that sound correct? I mentioned this briefly to @attilapiros and he mentioned that might be hard, but instead you could stop allocation blacklisting which would result in the usual yarn app failure from too many executors. He's going to look at this a little more closely and report back here. I'd be OK with that -- the main goal is just make sure that an app doesn't hang if you've blacklisted the entire cluster. I'm pretty sure that's @tgravescs main concern as well. (If the only reasonable way to do that is with the existing limit, I'm fine w/ that too.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 hey sorry I have been meaning to respond to this but keep getting sidetracked. As Tom and I are going to meet in person next week anyway, I figure at this point it makes sense to just wait till we chat directly to make sure we're on the same page. It sounds like we're in agreement but at this point might as well wait a couple more days, as I haven't had a chance to do a final review anyway --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91307/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91307/testReport)** for PR 21068 at commit [`0e78b38`](https://github.com/apache/spark/commit/0e78b383b6f00cbcf7bab53885e7b38da0544dde). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 well the downside to that and not just failing the application is similar to what @squito was mentioning, if the cluster is just busy and you can't get containers on those last few nodes, it could hang there for a long time. More then likely all you are going to leave is a single node not blacklisted. But I guess you can have that even with the BLACKLIST_RATIO, its just that you can control that better. I guess perhaps for this pr we just leave as is like @squito mentioned and have it off by default. Have a followup to add notification into the driver. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #91307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91307/testReport)** for PR 21068 at commit [`0e78b38`](https://github.com/apache/spark/commit/0e78b383b6f00cbcf7bab53885e7b38da0544dde). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 @tgravescs what about removing YARN_BLACKLIST_MAX_NODE_BLACKLIST_RATIO config and when the set of backlisted nodes reaches numClusterNodes I stop synchronising the backlisted nodes toward YARN so there would be still some nodes not backlisted (the previous backlisted state so it is still different from state of the UI but for a short time) and the failures will be counted so finally the old mechanism using MAX_EXECUTOR_FAILURES (if configured) which would stop the app. This way mostRelevantSubsetOfBlacklistedNodes() and the Expiry from the scheduler blacklisted nodes can be removed from code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 so specifically on the limit, I'm ok with removing it as long as we have the basic check to fail. I guess perhaps you are saying the limit and that check are essentially the same thing? I was thinking that they were different in that if you remove the limit from yarn, then the driver and UI side wouldn't get out of sync since the only thing the yarn side would do is fail if it hit the condition that all nodes are blacklisted. If you leave the limit as is, like you mention it could be a bit confusing to the user as it could acquire an executor on the node that was blacklisted but on the yarn side we don't enforce due to the limit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 I mean when `YarnAllocatorBlacklistTracker` decides to blacklist because of allocation failures, it doesn't send any message back to the driver -- so the driver doesn't have a msg in the logs, nor in the event log nor a UI update. So in client mode, the user would need to get AM logs to know what was going on. Attila wanted to do it this way because of `mostRelevantSubsetOfBlacklistedNodes` -- it seemed weird to send an update to the driver when the blacklisting wasn't necessarily even in effect. Though now that I'm thinking about this, maybe it should just send the update anyway, even though that blacklist may effectively be ignored. Re: starvation -- I agree, though "eventually" for resources can be so long in practice that to users it all looks the same. Anyway, though you say you're OK with removing the limit, it seems like you feel more strongly about this then I do. So I think we can keep it, I don't think it prevents us from doing something else down the road. I do think we should add the notification to the driver, including a listener event, which just ignores `mostRelevantSubsetOfBlacklistedNodes`, unless anyone has a reason for not doing it. I suggest @attilapiros does that in a followup. If that plan sounds OK, then this is probably nearly ready to merge. But its been a little while since I've looked closely so I'll do another pass (probably tomorrow). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 What do you mean by adding notification to the driver? Like I mentioned I'm fine with removing the limit for now but I think we have to do something here if the entire cluster gets blacklisted, otherwise users jobs will just hang. Its one thing if resources aren't available at the moment (as that can happen regardless of blacklisting) and the assumption is they will eventually come available but if spark has blacklisted all the nodes in the cluster we should just fail if we aren't going to run. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 I totally understand your motivation for wanting the limit. But I'm trying to balance that against behavior which might not really achieve the desired effect and be even more confusing in some cases. It won't achieve the desired effect if your cluster has more nodes, but they're all tied up in other applications. It'll be confusing to users if they see notification about blacklisting in the logs and UI, but then still see spark trying to use those nodes anyway. I wonder if putting this in will make it hard All that said, I don't have a great alternative now, other than just removing the limit entirely for the moment and adding notification to the driver. We could have a more general starvation detector, which wouldn't only look at node count, but also look at delays in acquiring containers and finding places to schedule tasks (related to SPARK-15815 & SPARK-22148), but I don't want to tackle all of that here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 Ah, sorry haven't had time to get to back to this. Yeah the driver interaction could be an issue. But whether its the limit or just the yarn side blacklisting I think you would need some interaction there, right?Or you would have to have similar logic that says all nodes blacklisted in the yarn side and tell the application to fail. Otherwise you could blacklist the entire cluster based on container launch failures and it would be stuck because the driver blacklist wouldn't know about it. Personally I'd rather see a limit rather then the current failure as I think it would be more robust. In my opinion I would rather try it at some point and have it just fail the max task failures then not try at all. I've seen jobs fail if they only have 1 executor that gets blacklisted that could have worked fine if retried. The blacklisting logic isn't perfect. We do have the kill on blacklist which I haven't used much at this point which would also help that I guess. I guess for this I'm fine with removing the limit for now since that is the current behavior in the driver side since communicating back to the driver blacklist could be complicated.We do need to handle the all nodes are blacklisted on the yarn side issue though. I was going to say this could just be handled by making sure spark.yarn.max.executor.failures is sane. Since I don't think that is really the case now since with dynamic allocation its just based on Int.MaxValue or whatever the user specifies which could have nothing to do with the actual cluster size but you might have a small cluster and someone might want to try hard and allow it to fail twice per node or something like that if the yarn blacklisting is off. So do we just need another check to fail if all or after certain percent blacklisted. Did you have something in mind to replace the limit? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 ping @tgravescs . honestly I still don't love the blacklist limit, especially since it makes reporting back to the driver pretty confusing, and I don't think it buys us much. But I can live with it. and otherwise I think this is ready. I've also looked at Attila's tests on a real cluster --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90125/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90125/testReport)** for PR 21068 at commit [`2a8ab8d`](https://github.com/apache/spark/commit/2a8ab8d818fa92e563e31e2d904d3ca6871b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90125/testReport)** for PR 21068 at commit [`2a8ab8d`](https://github.com/apache/spark/commit/2a8ab8d818fa92e563e31e2d904d3ca6871b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90062/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90062/testReport)** for PR 21068 at commit [`5760d22`](https://github.com/apache/spark/commit/5760d2285f6e473d1e79a5e9680960069d6205f3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90062/testReport)** for PR 21068 at commit [`5760d22`](https://github.com/apache/spark/commit/5760d2285f6e473d1e79a5e9680960069d6205f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 I assume it is just a flaky R test. Jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90043/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90043/testReport)** for PR 21068 at commit [`5760d22`](https://github.com/apache/spark/commit/5760d2285f6e473d1e79a5e9680960069d6205f3). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #90043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90043/testReport)** for PR 21068 at commit [`5760d22`](https://github.com/apache/spark/commit/5760d2285f6e473d1e79a5e9680960069d6205f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89903/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89903/testReport)** for PR 21068 at commit [`17bbbee`](https://github.com/apache/spark/commit/17bbbee0cf952a32e44fd0767bba08814e351da2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89898/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89898/testReport)** for PR 21068 at commit [`4df2311`](https://github.com/apache/spark/commit/4df231177343e6be04ec76d8c65e886763a5a152). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89903/testReport)** for PR 21068 at commit [`17bbbee`](https://github.com/apache/spark/commit/17bbbee0cf952a32e44fd0767bba08814e351da2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89898/testReport)** for PR 21068 at commit [`4df2311`](https://github.com/apache/spark/commit/4df231177343e6be04ec76d8c65e886763a5a152). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89889/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89889 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89889/testReport)** for PR 21068 at commit [`0ba8510`](https://github.com/apache/spark/commit/0ba85108584d4e2c5649679a10543f9d2cfe367c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89889 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89889/testReport)** for PR 21068 at commit [`0ba8510`](https://github.com/apache/spark/commit/0ba85108584d4e2c5649679a10543f9d2cfe367c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 A couple more high-level thoughts: 1) Do we want to have a event posted about the node getting blacklisted? I think it would be useful. But then there needs to be a msg from the YarnAllocator back to the driver about the blacklisting. 2) I was thinking about how this interacts with [SPARK-13669](https://issues.apache.org/jira/browse/SPARK-13669). at first I was thinking this makes that entirely unnecessary, but I guess that is not true -- that is still useful if the external shuffle service goes down *after* the executor is started. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 ok sounds fine to me, so we should review as is then --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 @tgravescs on the blacklist ratio for task-based blacklisting -- there is nothing, but there are some related jiras: [SPARK-22148](https://issues.apache.org/jira/browse/SPARK-22148) & [SPARK-15815](https://issues.apache.org/jira/browse/SPARK-15815) to be honest I have doubts about the utility of the ratio ... if you really want to make sure blacklisting doesn't lead to starvation, you've got to have some other mechanism, as you could easily have the remaining nodes be occupied or have insufficient resources. Kubernetes doesn't do anything with the node blacklisting currently: [SPARK-23485](https://issues.apache.org/jira/browse/SPARK-23485) Mesos already has a notion of blacklisting nodes for failing to allocate containers, but its currently at odds with the task-based blacklist. https://github.com/apache/spark/pull/20640 is somewhat stalled because blacklisting based on allocation failures is missing in a general sense. In any case, I still think we shouldn't make the code more complex for something other clusters managers *might* use in the future, and that the current overall organization is fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89514/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89514/testReport)** for PR 21068 at commit [`c92a090`](https://github.com/apache/spark/commit/c92a090e6e3c1dc5776eef1946a28b45731e128b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 thanks for filing that jira @squito, I agree we should have blacklisting work with dynamic allocation disabled as well. (A bit of a tangent from this jira) I'm actually wondering now about the scheduler blacklisting and whether it should have a max blacklisted Ratio as well. I don't remember if we discussed this previously. For this, I'm fine either way, if there are people interested in doing the mesos/kubernetes stuff now we could certainly coordinate with them to see if there is something common we could do now. I haven't had time to keep up with those jira to know though. Otherwise this isn't public facing so we can do that when they decide to implement it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 I think Tom makes a good case for why this should live in the YarnAllocator as you have it. I also don't think you need to worry about creating an abstract class yet, that refactoring can be done when another cluster manager tries to share some code ... it would just be helpful to keep that use in mind. also I filed https://issues.apache.org/jira/browse/SPARK-24016 for updating the task-based node blacklist even with static allocation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89514/testReport)** for PR 21068 at commit [`c92a090`](https://github.com/apache/spark/commit/c92a090e6e3c1dc5776eef1946a28b45731e128b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/21068 Yes we can create an abstract class from `YarnAllocatorBlacklistTracker` (like `AbstractAllocatorBlacklistTracker`) where the method `synchronizeBlacklistedNodes` can have different implementations. In this case the core and the messages can stay as it is. As I see this is the less risky and cheaper solution. On the other hand having the complete blacklisting in the driver has a more centralized/clear design. We just have to make our mind where to go from here. Any help and suggestions are welcomed for the decision. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21068 > actually the only other thing I need to make sure is there aren't any delays if we now send the information from yarn allocator back to scheduler and then I assume it would need to get it back again from scheduler. During that the yarn allocator could be calling allocate() and updating things. So we need to make sure it gets the most up to date blacklist. > also I need to double check but the blacklist information isn't being sent to the yarn allocator when dynamic allocation is off right? We would want that to happen. yeah both good points. actually, don't we want to update the general node blacklist on the yarn allocator even when dynamic allocation is off? I don't think it gets updated at all unless dynamic allocation is on, it seems all the updates originate in `ExecutorAllocationManager`, the blacklist never actively pushes updates to the yarn allocator. That seems like an existing shortcoming. > do you know if mesos and/or kubernetes can provide this same information? I don't know about kubernetes at all. Mesos does provide info when a container fails. I don't think it lets you know the total cluster size, but that should be optional. Btw, node count is never going to be totally sufficient, as the remaining nodes might not actually be able to run your executors (smaller hardware, always taken up by higher priority applications, other constraints in a framework like mesos), its always going to be best effort. @attilapiros and I discussed this briefly yesterday, an alternative to moving everything into the BlacklistTracker on the driver is to just have some abstract base class, which is changed slightly for each cluster manager. Then you could keep the flow like it is here, with the extra blacklisting living in YarnAllocator still. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21068 actually the only other thing I need to make sure is there aren't any delays if we now send the information from yarn allocator back to scheduler and then I assume it would need to get it back again from scheduler. During that the yarn allocator could be calling allocate() and updating things. So we need to make sure it gets the most up to date blacklist. also I need to double check but the blacklist information isn't being sent to the yarn allocator when dynamic allocation is off right? We would want that to happen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89373/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89373/testReport)** for PR 21068 at commit [`57086bb`](https://github.com/apache/spark/commit/57086bb1369a522e19bc92f64607b453743605c7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89373/testReport)** for PR 21068 at commit [`57086bb`](https://github.com/apache/spark/commit/57086bb1369a522e19bc92f64607b453743605c7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89355/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21068 **[Test build #89355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89355/testReport)** for PR 21068 at commit [`e49bd0d`](https://github.com/apache/spark/commit/e49bd0de5c25df4eb65ba975e948e043c0e076cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21068 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89350/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org