[
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15479046#comment-15479046
]
Zameer Manji commented on AURORA-1769:
--------------------------------------
[~maximk]: I don't think that's sufficient. In reality, doing any blocking in
any event subscriber will delay propagation of events. Apply the following
patch to your repo:
{noformat}
diff --git c/examples/vagrant/upstart/aurora-scheduler.conf
w/examples/vagrant/upstart/aurora-scheduler.conf
index 91b27d7..f7419d4 100644
--- c/examples/vagrant/upstart/aurora-scheduler.conf
+++ w/examples/vagrant/upstart/aurora-scheduler.conf
@@ -51,4 +51,5 @@ exec bin/aurora-scheduler \
-mesos_role=aurora-role \
-populate_discovery_info=true \
-receive_revocable_resources=true \
- -allow_gpu_resource=true
+ -allow_gpu_resource=true \
+
-webhook_config=/home/vagrant/aurora/src/main/resources/org/apache/aurora/scheduler/webhook.json
diff --git c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
index e54aa19..ed61ac0 100644
--- c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
+++ w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
@@ -13,6 +13,7 @@
*/
package org.apache.aurora.scheduler.events;
+import com.google.common.base.Throwables;
import java.io.DataOutputStream;
import java.io.InputStream;
import java.net.HttpURLConnection;
@@ -23,6 +24,8 @@ import com.google.common.eventbus.Subscribe;
import com.google.inject.Inject;
+import org.apache.aurora.common.quantity.Amount;
+import org.apache.aurora.common.quantity.Time;
import org.apache.aurora.scheduler.events.PubsubEvent.EventSubscriber;
import org.apache.aurora.scheduler.events.PubsubEvent.TaskStateChange;
import org.slf4j.Logger;
@@ -104,7 +107,11 @@ public class Webhook implements EventSubscriber {
*/
@Subscribe
public void taskChangedState(TaskStateChange stateChange) {
- String eventJson = stateChange.toJson();
- callEndpoint(eventJson);
+ int i = Amount.of(15, Time.SECONDS).as(Time.MILLISECONDS);
+ try {
+ Thread.sleep(i);
+ } catch (InterruptedException e) {
+ Throwables.propagate(e);
+ }
}
}
{noformat}
Then in vagrant create a job with 100 tasks.
Then restart the scheduler, you will see that it will never register within one
minute because the async worker for the event bus is busy blocked delivering
{{TaskStateChange}} events. You can see this by checking {{/threads}} and see
the {{AsyncProcessor-*}} threads blocked in the {{Webhook}} class.
Since calling an external HTTP server can block for an unknown amount of time,
I think the solution here is to make the hook async and have the event
subscriber just place the event in a queue for processing. Then it can have
it's own thread pool for sending the requests out.
> Enabling webhook is synchronous and could cause longer leader reelection cycle
> ------------------------------------------------------------------------------
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
> Issue Type: Bug
> Reporter: Dmitriy Shirchenko
> Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of
> TaskStateChange events and caused scheduler to not be able to post
> DriverRegistered() message which caused Aurora scheduler to not register
> within 1 minute.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)