Re: Per task overhead

Scott Smith Thu, 20 Sep 2012 17:24:27 -0700

btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
It got rid of the excess connections, but only helped ~5-10%.


I then noticed that the proxy scheduler converts to serialized pb and
then back via python in order to hand objects over the wall.  I tried
the CPP protobuf implementation for python, and that helped another
10%.

The resulting trace from perf (Linux system profiler) looks like this:

    33.43%           python  python2.7
    12.21%     mesos-master  libmesos-0.9.0.so
    11.11%     mesos-master  [kernel.kallsyms]
     9.12%           python  [kernel.kallsyms]
     8.37%     mesos-master  libc-2.15.so
     7.11%           python  libc-2.15.so
     5.26%           python  _mesos.so
     4.16%     mesos-master  libstdc++.so.6.0.16
     2.76%           python  libstdc++.so.6.0.16
     2.27%           python  _net_proto2___python.so
     0.92%           python  libpthread-2.15.so
     0.76%     mesos-master  libpthread-2.15.so

I'm surprised how much time is spent in the python process (python2.7
- 33.43%) given what I measured from resourceOffers and statusUpdate.
Native python profilers don't seem to be of any assistance here (I
tried cProfile and yappi).  I ran top and used the H option to view
threads, and verified (with gettid()) that the majority (~80%) of the
time is spent in the scheduler driver thread.  The other ~20% was in a
thread that I couldn't identify, which leads me to believe that's the
libprocess event loop thread.

The amount of time in the kernel is a bit disconcerting; I do wonder
if the Mesos model simply requires too much communication (for a given
task, how many client<->master, client<->slave, and master<->slave
messages are there?), while direct scheduling would allow for 2, and
could then use Mesos for higher level resource allocation tasks and
failure detection/handling.  Furthermore, direct schedule would allow
for distributed scheduling.

On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
<[email protected]> wrote:
> Scott,
>
> This is amazingly great! We are committed to improving the performance of
> Mesos (and the underlying communication library, libprocess), and getting
> very detailed information like this is incredibly valuable. There are
> definitely a lot of low-hanging performance optimizations that we just
> haven't had a chance to get to yet.
>
> There is not a great document today that describes (a) the structural
> relationships between different components or (b) the protocol used between
> components (but it's pretty simple, it's HTTP with bodies that are just
> binary protocol buffers). I'll try and take it upon myself to deliver this
> document within the month. In the mean time, feel free to create a JIRA
> ticket about improving the performance of Mesos so that we can drill down
> to some real actionable code changes that can get committed.
>
> Thanks so much!
>
> Ben.
>
>
>
> On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]> wrote:
>
>> I am working on a project using Dpark, a port of Spark to python.  We
>> are running into issues where the overhead for scheduling a task is
>> too high.  On the m1.small nodes I'm using, it takes a minimum of
>> about 7ms to schedule and receive the completion for a scheduled task.
>>  I set up timers around the call resourceOffers and statusUpdate, and
>> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
>> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
>> so the actual Python CPU time is quite low), so it would seem that the
>> mesos master and client library are taking the other 6.2ms.  This
>> seems excessive, and I'm guessing there's some low hanging fruit to
>> optimize.
>>
>> I captured the output using strace, and found this:
>>
>> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> st_size=91, ...}) = 0
>> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
>> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
>> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
>> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
>> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
>> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
>> now in progress)
>> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
>> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
>> u64=4294967384}}}, 64, 573) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
>> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
>> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
>> u64=519691042951}}) = 0
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
>> [pid 25362] epoll_wait(87,  <unfinished ...>
>> [pid 25361] <... write resumed> )       = 8
>> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> st_size=91, ...}) = 0
>> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
>> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
>> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
>> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
>> u64=519691042951}}}, 64, 572) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
>> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
>> u64=523986010247}}) = 0
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
>> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
>> 64, 571) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
>> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
>> MSG_NOSIGNAL, NULL, 0) = 301
>> [pid 25362] close(135)                  = 0
>> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file descriptor)
>>
>> From this there are several possible optimizations:
>> 1. Don't context switch so often.  Opening a socket in one thread just
>> to toss it to another thread is inefficient as the kernel must switch
>> between processes, plus you need extra calls for the eventfd to signal
>> the other thread.
>> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
>> 3. Don't call shutdown after close.  You don't really need to call it at
>> all.
>> 4. Do more than 1 thing with a connection.  As far as I can tell, this
>> is a communication from the client library to the slave; it should be
>> possible for those to be longer living.
>> 5. Don't use gettimeofday so often; it is unreliable since time can
>> move backwards.  Excessive system calls in general should be avoided,
>> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
>> eliminated, and it could rely on some infrequently updated and coarse
>> value stored in memory.
>>
>> I'm not sure how much of this will address my performance issues or
>> not, but it seemed like a reasonable start.  I'm debating whether to
>> dive into the code myself, but the few times I took a look I quickly
>> got lost.  Is there a diagram on how the libraries/files fit together,
>> or maybe a protocol description (so I could just replace it with a
>> native python library?)
>>
>> Or alternatively, are there plans to improve the performance of Mesos?
>>  I know the Spark authors have a mode to bypass Mesos in their next
>> release; I'm debating whether to follow suit or not.
>> --
>>         Scott
>>



-- 
        Scott

Re: Per task overhead

Reply via email to