Scott,

This is amazingly great! We are committed to improving the performance of
Mesos (and the underlying communication library, libprocess), and getting
very detailed information like this is incredibly valuable. There are
definitely a lot of low-hanging performance optimizations that we just
haven't had a chance to get to yet.

There is not a great document today that describes (a) the structural
relationships between different components or (b) the protocol used between
components (but it's pretty simple, it's HTTP with bodies that are just
binary protocol buffers). I'll try and take it upon myself to deliver this
document within the month. In the mean time, feel free to create a JIRA
ticket about improving the performance of Mesos so that we can drill down
to some real actionable code changes that can get committed.

Thanks so much!

Ben.



On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]> wrote:

> I am working on a project using Dpark, a port of Spark to python.  We
> are running into issues where the overhead for scheduling a task is
> too high.  On the m1.small nodes I'm using, it takes a minimum of
> about 7ms to schedule and receive the completion for a scheduled task.
>  I set up timers around the call resourceOffers and statusUpdate, and
> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
> so the actual Python CPU time is quite low), so it would seem that the
> mesos master and client library are taking the other 6.2ms.  This
> seems excessive, and I'm guessing there's some low hanging fruit to
> optimize.
>
> I captured the output using strace, and found this:
>
> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> st_size=91, ...}) = 0
> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
> now in progress)
> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
> u64=4294967384}}}, 64, 573) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
> u64=519691042951}}) = 0
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
> [pid 25362] epoll_wait(87,  <unfinished ...>
> [pid 25361] <... write resumed> )       = 8
> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> st_size=91, ...}) = 0
> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
> u64=519691042951}}}, 64, 572) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
> u64=523986010247}}) = 0
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
> 64, 571) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
> MSG_NOSIGNAL, NULL, 0) = 301
> [pid 25362] close(135)                  = 0
> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file descriptor)
>
> From this there are several possible optimizations:
> 1. Don't context switch so often.  Opening a socket in one thread just
> to toss it to another thread is inefficient as the kernel must switch
> between processes, plus you need extra calls for the eventfd to signal
> the other thread.
> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
> 3. Don't call shutdown after close.  You don't really need to call it at
> all.
> 4. Do more than 1 thing with a connection.  As far as I can tell, this
> is a communication from the client library to the slave; it should be
> possible for those to be longer living.
> 5. Don't use gettimeofday so often; it is unreliable since time can
> move backwards.  Excessive system calls in general should be avoided,
> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
> eliminated, and it could rely on some infrequently updated and coarse
> value stored in memory.
>
> I'm not sure how much of this will address my performance issues or
> not, but it seemed like a reasonable start.  I'm debating whether to
> dive into the code myself, but the few times I took a look I quickly
> got lost.  Is there a diagram on how the libraries/files fit together,
> or maybe a protocol description (so I could just replace it with a
> native python library?)
>
> Or alternatively, are there plans to improve the performance of Mesos?
>  I know the Spark authors have a mode to bypass Mesos in their next
> release; I'm debating whether to follow suit or not.
> --
>         Scott
>

Reply via email to