[
https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275888#comment-15275888
]
haosdent commented on MESOS-5340:
---------------------------------
According to my test, I think this is not related to ssl downgrading. It could
happen when we {{export SSL_SUPPORT_DOWNGRADE=false}}
{code}
# Console 1
$ telnet localhost 5050
{code}
{code}
$ curl https://www.haosdent.me:5050/master/slaves
# stuck
{code}
It is because the handle logic of accept is serial in {{process.cpp}}.
{code}
void on_accept(const Future<Socket>& socket)
{
LOG(INFO) << "Start accept socket";
if (socket.isReady()) {
// Inform the socket manager for proper bookkeeping.
socket_manager->accepted(socket.get());
const size_t size = 80 * 1024;
char* data = new char[size];
DataDecoder* decoder = new DataDecoder(socket.get());
socket.get().recv(data, size)
.onAny(lambda::bind(
&internal::decode_recv,
lambda::_1,
data,
size,
new Socket(socket.get()),
decoder));
}
__s__->accept()
.onAny(lambda::bind(&on_accept, lambda::_1));
}
{code}
{{process}} only continue to handle the next {{Future<Socket>}} item from
{{LibeventSSLSocketImpl::accept_queue}} after current one success or fail.
> SSL-downgrading support may prevent new connections
> ---------------------------------------------------
>
> Key: MESOS-5340
> URL: https://issues.apache.org/jira/browse/MESOS-5340
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.29.0, 0.28.1
> Reporter: Till Toenshoff
> Priority: Blocker
> Labels: ssl
>
> When using an SSL-enabled build of Mesos in combination with SSL-downgrading
> support, any connection that does not actually transmit data will hang the
> runnable (e.g. master).
> For reproducing the issue (on any platform)...
> Spin up a master with enabled SSL-downgrading:
> {noformat}
> $ export SSL_ENABLED=true
> $ export SSL_SUPPORT_DOWNGRADE=true
> $ export SSL_KEY_FILE=/path/to/your/foo.key
> $ export SSL_CERT_FILE=/path/to/your/foo.crt
> $ export SSL_CA_FILE=/path/to/your/ca.crt
> $ ./bin/mesos-master.sh --work_dir=/tmp/foo
> {noformat}
> Create some artificial HTTP request load for quickly spotting the problem in
> both, the master logs as well as the output of CURL itself:
> {noformat}
> $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL
> Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N";
> echo); done
> {noformat}
> Now create a connection to the master that does not transmit any data:
> {noformat}
> $ telnet localhost 5050
> {noformat}
> You should now see the CURL requests hanging, the master stops responding to
> new connections. This will persist until either some data is transmitted via
> the above telnet connection or it is closed.
> This problem has initially been observed when running Mesos on an AWS cluster
> with enabled internal ELB health-checks for the master node. Those
> health-checks are using long-lasting connections that do not transmit any
> data and are closed after a configurable duration. In our test environment,
> this duration was set to 60 seconds and hence we were seeing our master
> getting repetitively unresponsive for 60 seconds, then getting "unstuck" for
> a brief period until it got stuck again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)