Re: Per task overhead

Benjamin Mahler Thu, 20 Sep 2012 18:32:24 -0700

Great stuff Tom!

I'll let benh answer your performance questions, but as for the protobuf
issue, I've created a
ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this
particular performance improvement. We should definitely be
using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should also
be using pre-generated cpp protobuf code.


P.S. patches <https://reviews.apache.org/groups/mesos/> are always welcome
and appreciated! :)

On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <[email protected]> wrote:

> btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
> It got rid of the excess connections, but only helped ~5-10%.
>
> I then noticed that the proxy scheduler converts to serialized pb and
> then back via python in order to hand objects over the wall.  I tried
> the CPP protobuf implementation for python, and that helped another
> 10%.
>
> The resulting trace from perf (Linux system profiler) looks like this:
>
>     33.43%           python  python2.7
>     12.21%     mesos-master  libmesos-0.9.0.so
>     11.11%     mesos-master  [kernel.kallsyms]
>      9.12%           python  [kernel.kallsyms]
>      8.37%     mesos-master  libc-2.15.so
>      7.11%           python  libc-2.15.so
>      5.26%           python  _mesos.so
>      4.16%     mesos-master  libstdc++.so.6.0.16
>      2.76%           python  libstdc++.so.6.0.16
>      2.27%           python  _net_proto2___python.so
>      0.92%           python  libpthread-2.15.so
>      0.76%     mesos-master  libpthread-2.15.so
>
> I'm surprised how much time is spent in the python process (python2.7
> - 33.43%) given what I measured from resourceOffers and statusUpdate.
> Native python profilers don't seem to be of any assistance here (I
> tried cProfile and yappi).  I ran top and used the H option to view
> threads, and verified (with gettid()) that the majority (~80%) of the
> time is spent in the scheduler driver thread.  The other ~20% was in a
> thread that I couldn't identify, which leads me to believe that's the
> libprocess event loop thread.
>
> The amount of time in the kernel is a bit disconcerting; I do wonder
> if the Mesos model simply requires too much communication (for a given
> task, how many client<->master, client<->slave, and master<->slave
> messages are there?), while direct scheduling would allow for 2, and
> could then use Mesos for higher level resource allocation tasks and
> failure detection/handling.  Furthermore, direct schedule would allow
> for distributed scheduling.
>
> On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
> <[email protected]> wrote:
> > Scott,
> >
> > This is amazingly great! We are committed to improving the performance of
> > Mesos (and the underlying communication library, libprocess), and getting
> > very detailed information like this is incredibly valuable. There are
> > definitely a lot of low-hanging performance optimizations that we just
> > haven't had a chance to get to yet.
> >
> > There is not a great document today that describes (a) the structural
> > relationships between different components or (b) the protocol used
> between
> > components (but it's pretty simple, it's HTTP with bodies that are just
> > binary protocol buffers). I'll try and take it upon myself to deliver
> this
> > document within the month. In the mean time, feel free to create a JIRA
> > ticket about improving the performance of Mesos so that we can drill down
> > to some real actionable code changes that can get committed.
> >
> > Thanks so much!
> >
> > Ben.
> >
> >
> >
> > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]>
> wrote:
> >
> >> I am working on a project using Dpark, a port of Spark to python.  We
> >> are running into issues where the overhead for scheduling a task is
> >> too high.  On the m1.small nodes I'm using, it takes a minimum of
> >> about 7ms to schedule and receive the completion for a scheduled task.
> >>  I set up timers around the call resourceOffers and statusUpdate, and
> >> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
> >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
> >> so the actual Python CPU time is quite low), so it would seem that the
> >> mesos master and client library are taking the other 6.2ms.  This
> >> seems excessive, and I'm guessing there's some low hanging fruit to
> >> optimize.
> >>
> >> I captured the output using strace, and found this:
> >>
> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> >> st_size=91, ...}) = 0
> >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
> >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
> >> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
> >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
> >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
> >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
> >> now in progress)
> >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
> >> u64=4294967384}}}, 64, 573) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
> >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
> >> u64=519691042951}}) = 0
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
> >> [pid 25362] epoll_wait(87,  <unfinished ...>
> >> [pid 25361] <... write resumed> )       = 8
> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> >> st_size=91, ...}) = 0
> >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
> >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
> >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
> >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
> >> u64=519691042951}}}, 64, 572) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
> >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
> >> u64=523986010247}}) = 0
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
> >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
> >> 64, 571) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
> >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
> >> MSG_NOSIGNAL, NULL, 0) = 301
> >> [pid 25362] close(135)                  = 0
> >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file
> descriptor)
> >>
> >> From this there are several possible optimizations:
> >> 1. Don't context switch so often.  Opening a socket in one thread just
> >> to toss it to another thread is inefficient as the kernel must switch
> >> between processes, plus you need extra calls for the eventfd to signal
> >> the other thread.
> >> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
> >> 3. Don't call shutdown after close.  You don't really need to call it at
> >> all.
> >> 4. Do more than 1 thing with a connection.  As far as I can tell, this
> >> is a communication from the client library to the slave; it should be
> >> possible for those to be longer living.
> >> 5. Don't use gettimeofday so often; it is unreliable since time can
> >> move backwards.  Excessive system calls in general should be avoided,
> >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
> >> eliminated, and it could rely on some infrequently updated and coarse
> >> value stored in memory.
> >>
> >> I'm not sure how much of this will address my performance issues or
> >> not, but it seemed like a reasonable start.  I'm debating whether to
> >> dive into the code myself, but the few times I took a look I quickly
> >> got lost.  Is there a diagram on how the libraries/files fit together,
> >> or maybe a protocol description (so I could just replace it with a
> >> native python library?)
> >>
> >> Or alternatively, are there plans to improve the performance of Mesos?
> >>  I know the Spark authors have a mode to bypass Mesos in their next
> >> release; I'm debating whether to follow suit or not.
> >> --
> >>         Scott
> >>
>
>
>
> --
>         Scott
>

Re: Per task overhead

Reply via email to