Re: Per task overhead

Benjamin Mahler Thu, 20 Sep 2012 19:07:05 -0700

And by Tom, I mean Scott :)

On Thu, Sep 20, 2012 at 6:31 PM, Benjamin Mahler <[email protected]>wrote:


> Great stuff Tom!
>
> I'll let benh answer your performance questions, but as for the protobuf
> issue, I've created a 
> ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this 
> particular performance improvement. We should definitely be
> using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should
> also be using pre-generated cpp protobuf code.
>
> P.S. patches <https://reviews.apache.org/groups/mesos/> are always
> welcome and appreciated! :)
>
> On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <[email protected]>wrote:
>
>> btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
>> It got rid of the excess connections, but only helped ~5-10%.
>>
>> I then noticed that the proxy scheduler converts to serialized pb and
>> then back via python in order to hand objects over the wall.  I tried
>> the CPP protobuf implementation for python, and that helped another
>> 10%.
>>
>> The resulting trace from perf (Linux system profiler) looks like this:
>>
>>     33.43%           python  python2.7
>>     12.21%     mesos-master  libmesos-0.9.0.so
>>     11.11%     mesos-master  [kernel.kallsyms]
>>      9.12%           python  [kernel.kallsyms]
>>      8.37%     mesos-master  libc-2.15.so
>>      7.11%           python  libc-2.15.so
>>      5.26%           python  _mesos.so
>>      4.16%     mesos-master  libstdc++.so.6.0.16
>>      2.76%           python  libstdc++.so.6.0.16
>>      2.27%           python  _net_proto2___python.so
>>      0.92%           python  libpthread-2.15.so
>>      0.76%     mesos-master  libpthread-2.15.so
>>
>> I'm surprised how much time is spent in the python process (python2.7
>> - 33.43%) given what I measured from resourceOffers and statusUpdate.
>> Native python profilers don't seem to be of any assistance here (I
>> tried cProfile and yappi).  I ran top and used the H option to view
>> threads, and verified (with gettid()) that the majority (~80%) of the
>> time is spent in the scheduler driver thread.  The other ~20% was in a
>> thread that I couldn't identify, which leads me to believe that's the
>> libprocess event loop thread.
>>
>> The amount of time in the kernel is a bit disconcerting; I do wonder
>> if the Mesos model simply requires too much communication (for a given
>> task, how many client<->master, client<->slave, and master<->slave
>> messages are there?), while direct scheduling would allow for 2, and
>> could then use Mesos for higher level resource allocation tasks and
>> failure detection/handling.  Furthermore, direct schedule would allow
>> for distributed scheduling.
>>
>> On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
>> <[email protected]> wrote:
>> > Scott,
>> >
>> > This is amazingly great! We are committed to improving the performance
>> of
>> > Mesos (and the underlying communication library, libprocess), and
>> getting
>> > very detailed information like this is incredibly valuable. There are
>> > definitely a lot of low-hanging performance optimizations that we just
>> > haven't had a chance to get to yet.
>> >
>> > There is not a great document today that describes (a) the structural
>> > relationships between different components or (b) the protocol used
>> between
>> > components (but it's pretty simple, it's HTTP with bodies that are just
>> > binary protocol buffers). I'll try and take it upon myself to deliver
>> this
>> > document within the month. In the mean time, feel free to create a JIRA
>> > ticket about improving the performance of Mesos so that we can drill
>> down
>> > to some real actionable code changes that can get committed.
>> >
>> > Thanks so much!
>> >
>> > Ben.
>> >
>> >
>> >
>> > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]>
>> wrote:
>> >
>> >> I am working on a project using Dpark, a port of Spark to python.  We
>> >> are running into issues where the overhead for scheduling a task is
>> >> too high.  On the m1.small nodes I'm using, it takes a minimum of
>> >> about 7ms to schedule and receive the completion for a scheduled task.
>> >>  I set up timers around the call resourceOffers and statusUpdate, and
>> >> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
>> >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
>> >> so the actual Python CPU time is quite low), so it would seem that the
>> >> mesos master and client library are taking the other 6.2ms.  This
>> >> seems excessive, and I'm guessing there's some low hanging fruit to
>> >> optimize.
>> >>
>> >> I captured the output using strace, and found this:
>> >>
>> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> >> st_size=91, ...}) = 0
>> >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
>> >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
>> >> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
>> >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
>> >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
>> >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
>> >> now in progress)
>> >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
>> >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
>> >> u64=4294967384}}}, 64, 573) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
>> >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
>> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
>> >> u64=519691042951}}) = 0
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
>> >> [pid 25362] epoll_wait(87,  <unfinished ...>
>> >> [pid 25361] <... write resumed> )       = 8
>> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> >> st_size=91, ...}) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
>> >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
>> >> u64=519691042951}}}, 64, 572) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
>> >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
>> >> u64=523986010247}}) = 0
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
>> >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
>> >> 64, 571) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
>> >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
>> >> MSG_NOSIGNAL, NULL, 0) = 301
>> >> [pid 25362] close(135)                  = 0
>> >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file
>> descriptor)
>> >>
>> >> From this there are several possible optimizations:
>> >> 1. Don't context switch so often.  Opening a socket in one thread just
>> >> to toss it to another thread is inefficient as the kernel must switch
>> >> between processes, plus you need extra calls for the eventfd to signal
>> >> the other thread.
>> >> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
>> >> 3. Don't call shutdown after close.  You don't really need to call it
>> at
>> >> all.
>> >> 4. Do more than 1 thing with a connection.  As far as I can tell, this
>> >> is a communication from the client library to the slave; it should be
>> >> possible for those to be longer living.
>> >> 5. Don't use gettimeofday so often; it is unreliable since time can
>> >> move backwards.  Excessive system calls in general should be avoided,
>> >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
>> >> eliminated, and it could rely on some infrequently updated and coarse
>> >> value stored in memory.
>> >>
>> >> I'm not sure how much of this will address my performance issues or
>> >> not, but it seemed like a reasonable start.  I'm debating whether to
>> >> dive into the code myself, but the few times I took a look I quickly
>> >> got lost.  Is there a diagram on how the libraries/files fit together,
>> >> or maybe a protocol description (so I could just replace it with a
>> >> native python library?)
>> >>
>> >> Or alternatively, are there plans to improve the performance of Mesos?
>> >>  I know the Spark authors have a mode to bypass Mesos in their next
>> >> release; I'm debating whether to follow suit or not.
>> >> --
>> >>         Scott
>> >>
>>
>>
>>
>> --
>>         Scott
>>
>
>

Re: Per task overhead

Reply via email to