[ 
https://issues.apache.org/jira/browse/MESOS-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114982#comment-14114982
 ] 

Niklas Quarfot Nielsen commented on MESOS-1706:
-----------------------------------------------

After a bit of investigation, it is related to how libprocess _sends_ data - 
and, as far as I can see, not how it is being received.

When the outgoing queue is empty, the socket is closed (or more precisely, the 
socket object's ref is decremented) and cause the reader end to close too, 
whereas the connection need to be reestablished repeatedly.

I could think of a (localized compared to pluggability of the whole event 
system) path forward; 
1) Instead of closing immediately, schedule the socket to be closed with an 
attached timeout. When the socket is used again, remove it from the "evict" 
queue.
2) Keep track of the number of connections being kept alive and trigger 
eviction on hitting the limit and periodically to clean up "timed out" 
connections.

Thoughts? Anyone want to shepherd this kind of change?

_A quick hack showed that this (on my local box) gets us from around 8000 
message round trips per second to +20.000 message rounds trips per second._

> Introduce socket / connection pooling to libprocess
> ---------------------------------------------------
>
>                 Key: MESOS-1706
>                 URL: https://issues.apache.org/jira/browse/MESOS-1706
>             Project: Mesos
>          Issue Type: Improvement
>          Components: libprocess
>            Reporter: Niklas Quarfot Nielsen
>
> Just wrote a libprocess connection throughput stress test (basically two 
> libprocess programs sending messsages back and forth). One end is multihomed 
> so we can scale up the number of clients.
> The throughput with a single client (10 "concurrent" connections or rather, 
> send up to 10 message before awaiting responses) is roughly 8000 - 9000 
> requests per second.
> I think I (accidentially) produced more load (around 30.000 requests per 
> second) - but I am running into one particular error in both cases: `Failed 
> to send, connect: Cannot assign requested address`.  According to 
> http://khanna111.com/articles/TCPAAIU.html - it seems the only way around it 
> is the some kind of connection pooling (we already use SO_REUSEADDR). 
> It happens during connect() and hints that the machine is running out of 
> available ports on the sender end (when getting randomly assigned ports).
> {code}
> I0815 07:03:49.348409 30317 main.cpp:109] 8984.79 requests / second (delta: 
> 1.000356864secs)
> I0815 07:03:50.348898 30320 main.cpp:109] 8715.88 requests / second (delta: 
> 1.000473088secs)
> I0815 07:03:51.349040 30317 main.cpp:109] 8622.64 requests / second (delta: 
> 1.000157184secs)
> I0815 07:03:52.349184 30320 main.cpp:109] 9039.69 requests / second (delta: 
> 1.000144896secs)
> I0815 07:03:53.349478 30319 main.cpp:109] 8768.42 requests / second (delta: 
> 1.000293888secs)
> I0815 07:03:54.349954 30322 main.cpp:109] 8728.9 requests / second (delta: 
> 1.000470016secs)
> I0815 07:03:55.350334 30316 main.cpp:109] 8628.79 requests / second (delta: 
> 1.000371968secs)
> I0815 07:03:56.350957 30320 main.cpp:109] 8726.57 requests / second (delta: 
> 1.000621824secs)
> I0815 07:03:57.351474 30318 main.cpp:109] 8587.46 requests / second (delta: 
> 1.000529152secs)
> I0815 07:03:58.351805 30314 main.cpp:109] 8475.16 requests / second (delta: 
> 1.000335104secs)
> F0815 07:03:59.092653 30323 process.cpp:2197] Failed to send, connect: Cannot 
> assign requested address [99]
> *** Check failure stack trace: ***
> Aborted
> {code}
> One way to deal with it couple be to introduce the notion of connection 
> pooling.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to