And by Tom, I mean Scott :) On Thu, Sep 20, 2012 at 6:31 PM, Benjamin Mahler <[email protected]>wrote:
> Great stuff Tom! > > I'll let benh answer your performance questions, but as for the protobuf > issue, I've created a > ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this > particular performance improvement. We should definitely be > using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should > also be using pre-generated cpp protobuf code. > > P.S. patches <https://reviews.apache.org/groups/mesos/> are always > welcome and appreciated! :) > > On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <[email protected]>wrote: > >> btw I tried calling 'link' for each new slave in src/sched/sched.cpp. >> It got rid of the excess connections, but only helped ~5-10%. >> >> I then noticed that the proxy scheduler converts to serialized pb and >> then back via python in order to hand objects over the wall. I tried >> the CPP protobuf implementation for python, and that helped another >> 10%. >> >> The resulting trace from perf (Linux system profiler) looks like this: >> >> 33.43% python python2.7 >> 12.21% mesos-master libmesos-0.9.0.so >> 11.11% mesos-master [kernel.kallsyms] >> 9.12% python [kernel.kallsyms] >> 8.37% mesos-master libc-2.15.so >> 7.11% python libc-2.15.so >> 5.26% python _mesos.so >> 4.16% mesos-master libstdc++.so.6.0.16 >> 2.76% python libstdc++.so.6.0.16 >> 2.27% python _net_proto2___python.so >> 0.92% python libpthread-2.15.so >> 0.76% mesos-master libpthread-2.15.so >> >> I'm surprised how much time is spent in the python process (python2.7 >> - 33.43%) given what I measured from resourceOffers and statusUpdate. >> Native python profilers don't seem to be of any assistance here (I >> tried cProfile and yappi). I ran top and used the H option to view >> threads, and verified (with gettid()) that the majority (~80%) of the >> time is spent in the scheduler driver thread. The other ~20% was in a >> thread that I couldn't identify, which leads me to believe that's the >> libprocess event loop thread. >> >> The amount of time in the kernel is a bit disconcerting; I do wonder >> if the Mesos model simply requires too much communication (for a given >> task, how many client<->master, client<->slave, and master<->slave >> messages are there?), while direct scheduling would allow for 2, and >> could then use Mesos for higher level resource allocation tasks and >> failure detection/handling. Furthermore, direct schedule would allow >> for distributed scheduling. >> >> On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman >> <[email protected]> wrote: >> > Scott, >> > >> > This is amazingly great! We are committed to improving the performance >> of >> > Mesos (and the underlying communication library, libprocess), and >> getting >> > very detailed information like this is incredibly valuable. There are >> > definitely a lot of low-hanging performance optimizations that we just >> > haven't had a chance to get to yet. >> > >> > There is not a great document today that describes (a) the structural >> > relationships between different components or (b) the protocol used >> between >> > components (but it's pretty simple, it's HTTP with bodies that are just >> > binary protocol buffers). I'll try and take it upon myself to deliver >> this >> > document within the month. In the mean time, feel free to create a JIRA >> > ticket about improving the performance of Mesos so that we can drill >> down >> > to some real actionable code changes that can get committed. >> > >> > Thanks so much! >> > >> > Ben. >> > >> > >> > >> > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <[email protected]> >> wrote: >> > >> >> I am working on a project using Dpark, a port of Spark to python. We >> >> are running into issues where the overhead for scheduling a task is >> >> too high. On the m1.small nodes I'm using, it takes a minimum of >> >> about 7ms to schedule and receive the completion for a scheduled task. >> >> I set up timers around the call resourceOffers and statusUpdate, and >> >> I was able to get Dpark down to about 0.7-0.8ms per task (of that, >> >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks, >> >> so the actual Python CPU time is quite low), so it would seem that the >> >> mesos master and client library are taking the other 6.2ms. This >> >> seems excessive, and I'm guessing there's some low hanging fruit to >> >> optimize. >> >> >> >> I captured the output using strace, and found this: >> >> >> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, >> >> st_size=91, ...}) = 0 >> >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0 >> >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135 >> >> [pid 25361] fcntl(135, F_GETFL) = 0x2 (flags O_RDWR) >> >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >> >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0 >> >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048), >> >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation >> >> now in progress) >> >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...> >> >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88, >> >> u64=4294967384}}}, 64, 573) = 1 >> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0 >> >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8 >> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135, >> >> u64=519691042951}}) = 0 >> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0 >> >> [pid 25362] epoll_wait(87, <unfinished ...> >> >> [pid 25361] <... write resumed> ) = 8 >> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, >> >> st_size=91, ...}) = 0 >> >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0 >> >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0 >> >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0 >> >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135, >> >> u64=519691042951}}}, 64, 572) = 1 >> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0 >> >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 >> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135, >> >> u64=523986010247}}) = 0 >> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0 >> >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}}, >> >> 64, 571) = 1 >> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0 >> >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301, >> >> MSG_NOSIGNAL, NULL, 0) = 301 >> >> [pid 25362] close(135) = 0 >> >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file >> descriptor) >> >> >> >> From this there are several possible optimizations: >> >> 1. Don't context switch so often. Opening a socket in one thread just >> >> to toss it to another thread is inefficient as the kernel must switch >> >> between processes, plus you need extra calls for the eventfd to signal >> >> the other thread. >> >> 2. Skip DNS resolution. Unnecessary to stat resolv.conf so often. >> >> 3. Don't call shutdown after close. You don't really need to call it >> at >> >> all. >> >> 4. Do more than 1 thing with a connection. As far as I can tell, this >> >> is a communication from the client library to the slave; it should be >> >> possible for those to be longer living. >> >> 5. Don't use gettimeofday so often; it is unreliable since time can >> >> move backwards. Excessive system calls in general should be avoided, >> >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be >> >> eliminated, and it could rely on some infrequently updated and coarse >> >> value stored in memory. >> >> >> >> I'm not sure how much of this will address my performance issues or >> >> not, but it seemed like a reasonable start. I'm debating whether to >> >> dive into the code myself, but the few times I took a look I quickly >> >> got lost. Is there a diagram on how the libraries/files fit together, >> >> or maybe a protocol description (so I could just replace it with a >> >> native python library?) >> >> >> >> Or alternatively, are there plans to improve the performance of Mesos? >> >> I know the Spark authors have a mode to bypass Mesos in their next >> >> release; I'm debating whether to follow suit or not. >> >> -- >> >> Scott >> >> >> >> >> >> -- >> Scott >> > >
