[jira] [Commented] (AURORA-1493) create ELB-friendly endpoint to detect leading scheduler

Ashwin Murthy (JIRA) Wed, 30 Mar 2016 16:22:01 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219051#comment-15219051
 ]


Ashwin Murthy commented on AURORA-1493:
---------------------------------------

Hi Bill, 

Thanks for your help! My current thoughts:

1. Create a new http endpoint called /IsLeader. Add this to LEADER_ENDPOINTS
   in JettyServerModule.java. 

2. Create corresponding servlet class (similar to Locks.java). 

3. implement a GET method where I can do something like:

Optional<HostAndPort> leaderHttp = getLeaderHttp();
Optional<HostAndPort> localHttp = getLocalHttp();

if (leaderHttp.isPresent() && leaderHttp.equals(localHttp)) {
  return LeaderStatus.LEADING;
}

4. This is similar to what is the the LeaderRedirect::getRedirectStatus 

5. If leader, return 200, else 503. 

What do you think?

Bill Farner
Mar 24 (6 days ago)

to me 
>From the ELB docs on health checks, i see this:

For HTTP/HTTPS, you must include a ping path in the string. HTTP is specified 
as a HTTP:port;/;PathToPing; grouping, for example 
"HTTP:80/weather/us/wa/seattle". In this case, a HTTP GET request is issued to 
the instance on the given port and path. Any answer other than "200 OK" within 
the timeout period is considered unhealthy.

This makes me wonder if the desired behavior is already there with any 
endpoints running through LeaderRedirectFilter (which includes 
LEADER_ENDPOINTS).  Those endpoints only return 200 if the instance is leading.

Can you double-check if we're already in good shape?


Ashwin Murthy <[email protected]>
Mar 25 (5 days ago)

to Bill 
OK. I will test this in our env by settting up the health checks to go against 
one of the LEADER_ENDPOINTS and seeing if this works. 

Thanks Bill!


Ashwin Murthy <[email protected]>
Mar 25 (5 days ago)

to Bill 
Hi Bill, 

So this is what I see and I think it kind of aligns with my understanding of 
how things might work. When you issue any http request on any of the 
non-leading schedulers, a temporary redirect 307 is sent. the location header 
contains the leader's host:port and path of the original request. The http 
client/browser will reconnect. This is what I see happen in our prod aurora 
env. After I issue this in the browser, I see the browser load the redirected 
page from the leader.

====================
Request URL:http://<non-leader-hostname>:8082/slaves
Request Method:GET
Status Code:307 Temporary Redirect
Remote Address:127.0.0.1:8127
Response Headers
view source
Content-Length:0
Date:Fri, 25 Mar 2016 23:50:48 GMT
Location:http://10.162.9.54:8082/slaves
Server:Jetty(9.3.6.v20151106)
Request Headers
view source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:<non-leader-hostname>:8082
Referer: <non-leader-hostname>:8082/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36

I think we might need to add a new EP (say  /isLeader) which is actually not 
part of the redirect filter and then return 200 or 500 accordingly. What do you 
think?


Bill Farner
Mar 25 (5 days ago)

to me 
That behavior matches my understand, but according to the ELB docs, that should 
work for a health check (200=healthy, non-200=unhealthy).  Did you find 
otherwise in ELB behavior, or contradicting docs?


Ashwin Murthy <[email protected]>
Mar 25 (5 days ago)

to Bill 
Ah, i haven' tested the non-200 health check per se on ELB. But I can test this 
out in our prod env which uses HAProxy for health check. Our load balancer team 
did say a 500 level error code. But let me confirm.

But thinking about this more. Even if ELB treats 307 as unhealthy health check, 
this seems kind of a hack to me. It is possible that other load balancers 
infact honor the redirect. I used to work in Azure networking before Uber and I 
know their L7 LB was planning support to handle redirect. 

>From a http perspective, it might be better to send 500 level.


Bill Farner
Mar 25 (5 days ago)

to me 
Looks like HAproxy allows you to specify the expected status code.

An HTTP load balancer following redirects sounds pretty bizarre, but I've seen 
stranger things :-)

At any rate, I'm cool with a /leaderhealth endpoint that is 200/503 based on 
leading status.  Let me know if that's what you want to do, and if you need any 
pointers to get going.


Ashwin Murthy <[email protected]>
Mar 28 (2 days ago)

to Bill 
Hi Bill, 

I will go ahead and add this. does my proposed set of changes in this thread 
(earlier) sound about right?


Bill Farner
Mar 29 (1 day ago)

to me 
Yup, that's how you should approach.

> create ELB-friendly endpoint to detect leading scheduler
> --------------------------------------------------------
>
>                 Key: AURORA-1493
>                 URL: https://issues.apache.org/jira/browse/AURORA-1493
>             Project: Aurora
>          Issue Type: Task
>          Components: Scheduler, Usability
>            Reporter: brian wickman
>            Assignee: Ashwin Murthy
>
> iiuc hitting the web ui for non-leading schedulers redirects to the leader.  
> this doesn't really help the members of the ensemble are not publicly 
> routable.
> if there was a /leader endpoint that returned "200 OK" if it is leader and 
> some 3xx/4xx code if not, then it would be easier to configure an ELB to 
> route traffic to the correct leader, simplifying the use of aurora in an AWS 
> deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1493) create ELB-friendly endpoint to detect leading scheduler

Reply via email to