Great stuff Tom! I'll let benh answer your performance questions, but as for the protobuf issue, I've created a ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this particular performance improvement. We should definitely be using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should also be using pre-generated cpp protobuf code.
P.S. patches <https://reviews.apache.org/groups/mesos/> are always welcome and appreciated! :) On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <[email protected]> wrote: > btw I tried calling 'link' for each new slave in src/sched/sched.cpp. > It got rid of the excess connections, but only helped ~5-10%. > > I then noticed that the proxy scheduler converts to serialized pb and > then back via python in order to hand objects over the wall. I tried > the CPP protobuf implementation for python, and that helped another > 10%. > > The resulting trace from perf (Linux system profiler) looks like this: > > 33.43% python python2.7 > 12.21% mesos-master libmesos-0.9.0.so > 11.11% mesos-master [kernel.kallsyms] > 9.12% python [kernel.kallsyms] > 8.37% mesos-master libc-2.15.so > 7.11% python libc-2.15.so > 5.26% python _mesos.so > 4.16% mesos-master libstdc++.so.6.0.16 > 2.76% python libstdc++.so.6.0.16 > 2.27% python _net_proto2___python.so > 0.92% python libpthread-2.15.so > 0.76% mesos-master libpthread-2.15.so > > I'm surprised how much time is spent in the python process (python2.7 > - 33.43%) given what I measured from resourceOffers and statusUpdate. > Native python profilers don't seem to be of any assistance here (I > tried cProfile and yappi). I ran top and used the H option to view > threads, and verified (with gettid()) that the majority (~80%) of the > time is spent in the scheduler driver thread. The other ~20% was in a > thread that I couldn't identify, which leads me to believe that's the > libprocess event loop thread. > > The amount of time in the kernel is a bit disconcerting; I do wonder > if the Mesos model simply requires too much communication (for a given > task, how many client<->master, client<->slave, and master<->slave > messages are there?), while direct scheduling would allow for 2, and > could then use Mesos for higher level resource allocation tasks and > failure detection/handling. Furthermore, direct schedule would allow > for distributed scheduling. > > On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman > <[email protected]> wrote: > > Scott, > > > > This is amazingly great! We are committed to improving the performance of > > Mesos (and the underlying communication library, libprocess), and getting > > very detailed information like this is incredibly valuable. There are > > definitely a lot of low-hanging performance optimizations that we just > > haven't had a chance to get to yet. > > > > There is not a great document today that describes (a) the structural > > relationships between different components or (b) the protocol used > between > > components (but it's pretty simple, it's HTTP with bodies that are just > > binary protocol buffers). I'll try and take it upon myself to deliver > this > > document within the month. In the mean time, feel free to create a JIRA > > ticket about improving the performance of Mesos so that we can drill down > > to some real actionable code changes that can get committed. > > > > Thanks so much! > > > > Ben. > > > > > > > > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]> > wrote: > > > >> I am working on a project using Dpark, a port of Spark to python. We > >> are running into issues where the overhead for scheduling a task is > >> too high. On the m1.small nodes I'm using, it takes a minimum of > >> about 7ms to schedule and receive the completion for a scheduled task. > >> I set up timers around the call resourceOffers and statusUpdate, and > >> I was able to get Dpark down to about 0.7-0.8ms per task (of that, > >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks, > >> so the actual Python CPU time is quite low), so it would seem that the > >> mesos master and client library are taking the other 6.2ms. This > >> seems excessive, and I'm guessing there's some low hanging fruit to > >> optimize. > >> > >> I captured the output using strace, and found this: > >> > >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, > >> st_size=91, ...}) = 0 > >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0 > >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135 > >> [pid 25361] fcntl(135, F_GETFL) = 0x2 (flags O_RDWR) > >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0 > >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048), > >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation > >> now in progress) > >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...> > >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88, > >> u64=4294967384}}}, 64, 573) = 1 > >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0 > >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8 > >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135, > >> u64=519691042951}}) = 0 > >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0 > >> [pid 25362] epoll_wait(87, <unfinished ...> > >> [pid 25361] <... write resumed> ) = 8 > >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, > >> st_size=91, ...}) = 0 > >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0 > >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0 > >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0 > >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135, > >> u64=519691042951}}}, 64, 572) = 1 > >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0 > >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 > >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135, > >> u64=523986010247}}) = 0 > >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0 > >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}}, > >> 64, 571) = 1 > >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0 > >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301, > >> MSG_NOSIGNAL, NULL, 0) = 301 > >> [pid 25362] close(135) = 0 > >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file > descriptor) > >> > >> From this there are several possible optimizations: > >> 1. Don't context switch so often. Opening a socket in one thread just > >> to toss it to another thread is inefficient as the kernel must switch > >> between processes, plus you need extra calls for the eventfd to signal > >> the other thread. > >> 2. Skip DNS resolution. Unnecessary to stat resolv.conf so often. > >> 3. Don't call shutdown after close. You don't really need to call it at > >> all. > >> 4. Do more than 1 thing with a connection. As far as I can tell, this > >> is a communication from the client library to the slave; it should be > >> possible for those to be longer living. > >> 5. Don't use gettimeofday so often; it is unreliable since time can > >> move backwards. Excessive system calls in general should be avoided, > >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be > >> eliminated, and it could rely on some infrequently updated and coarse > >> value stored in memory. > >> > >> I'm not sure how much of this will address my performance issues or > >> not, but it seemed like a reasonable start. I'm debating whether to > >> dive into the code myself, but the few times I took a look I quickly > >> got lost. Is there a diagram on how the libraries/files fit together, > >> or maybe a protocol description (so I could just replace it with a > >> native python library?) > >> > >> Or alternatively, are there plans to improve the performance of Mesos? > >> I know the Spark authors have a mode to bypass Mesos in their next > >> release; I'm debating whether to follow suit or not. > >> -- > >> Scott > >> > > > > -- > Scott >
