Maksim,
Great work!
Discussion inline below.
Recommended next steps.
1) Add the setup steps required to get all of this working to the user's guide. File a jira.
2) Figure out a way to automate these tests.  Might be hard on Apache infra.
Kevin.

On 10/25/13 8:55 AM, Maksim Kononenko wrote:
Hi guys,

I was researching/testing Knox HA with Apache HTTP Server +  mod_proxy +
mod_proxy_balancer.
Here is what I found.
I.   3 load balancer scheduler algorithms available for use: Request
Counting, Weighted Traffic Counting and Pending Request Counting. (
http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html#scheduler)
II.  Load balancer stickyness. (
http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html#stickyness)
      I configured and tested stickyness. Worked as it had to be.
III. Failover. (
http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass)
      1. I ran foolowing use cases:
         a) Knox instance is down before client request comes in.
             Steps:
                 - Configure Apache HTTP Server to proxy two Knox instances;
                 - Shoot down Knox instance A;
                 - Execute client request;
                 - Verify that Knox instance A is marked as unavailable and
client's request is redirected to Knox instance B;
                 - Verify that all subsequent requests in scope of the same
client's session are passed just to Knox instance B;
                 - Verify that client's requests in scope of new session are
tried to be passed to Knox instance A.
                   It is required because Knox instance A could be started
before new client's session.
This seems a little sub-optimal to me but there may be nothing we can do about it. The issue that I have is that I don't think Apache should be trying instance-A first every time in this case. So the question is how is Apache distributing load over instance-A and instance-B? Does it always try instance-A first or does it sometimes try instance-B first? In addition if it gets a failure for instance-A ideally it would take it out of the "pool" for some (ideally configurable) period of time.
             This use case works fine.
         b) Knox instance goes down when it processes client's PUT request.
             Steps:
                 - Start executing PUT file to HDFS with medium size (200Mb);
                 - After some time shoot down Knox instance which processes
this request;
                 - Verify that client gets 500 status code and no failover
takes place.
             This use case works as it is described. Apache HTTP Server is
not able to do failover in this case.
         c) Knox instance goes down when it processes client's GET request.
             Steps:
                 - Start executing GET file from HDFS with medium size
(200Mb);
                 - After some time shoot down Knox instance which processes
this request;
                 - Verify that client gets 200 status code, 'Content-Length'
header with value equals to file size and some bytes in the body.
                   To execute this test I used as a client:
                     1) HttpClient - it doesn't produce any error when
stream is closed.
                     2) CURL - it doesn't produce any error when stream is
closed.
                     3) Firefox browser - it doesn't produce any error when
stream is closed.
                   All clients just download available bytes before stream
is closed, so client has to manually compare 'Content-Length' header value
and received bytes length.
                 - No failover takes place.
             This use case works as it is described. Apache HTTP Server is
not able to do failover in this case.
This is unexpected and unfortunate.
I would have hoped that HttpClient and cURL at least would provide some indication that the stream was incomplete according to the Content-Length header. The only thing I would recommend you trying is taking Knox out of the picture, use cURL to GET the same file directly from HDFS, kill the DataNode halfway through the stream and ensure that you see the same behavior on the client side.
      2. Additional use cases.
         What new cases could you advise?
I just want to confirm that you have tested a scenario for HDFS where the call to the NameNode goes to instance-A and the subsequent call to the DataNode goes to instance-B and this works.
IV. What functionality did I miss?
Other than the note above I don't see anything missing.

Maksim.



--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Reply via email to