> On April 25, 2013, 8:07 p.m., Benjamin Hindman wrote: > > Can you elaborate on why AtMost(1) was not sufficient? > > Thomas Marshall wrote: > I honestly have no idea, although I just realized that the JIRA issue I > linked to is actually referring to a different problem. That issue (timing > out on waiting for the registered future) is fixed by the recent change that > extended the amount of time that we wait on futures. The problem being fixed > by this patch looks something more like: > > ... > I0425 13:37:03.516690 30336 hierarchical_allocator_process.hpp:423] > Removed slave 201304251337-16842879-37747-30328-0 > I0425 13:37:03.516121 30334 slave.cpp:486] Slave asked to shut down by > [email protected]:37747 > I0425 13:37:03.517014 30334 slave.cpp:1099] Asked to shut down framework > 201304251337-16842879-37747-30328-0000 by [email protected]:37747 > W0425 13:37:03.517163 30334 slave.cpp:1120] Ignoring shutdown framework > 201304251337-16842879-37747-30328-0000 because it is terminating > I0425 13:37:03.518982 30334 slave.cpp:1867] [email protected]:37747 exited > [Thread 0x7fffb0cff700 (LWP 30378) exited] > W0425 13:37:03.519126 30334 slave.cpp:1870] Master disconnected! Waiting > for a new master to be elected > [Thread 0x7fffabfff700 (LWP 30379) exited] > I0425 13:37:03.521286 30328 slave.cpp:441] Slave terminating > I0425 13:37:03.521467 30328 slave.cpp:1099] Asked to shut down framework > 201304251337-16842879-37747-30328-0000 by @0.0.0.0:0 > W0425 13:37:03.521649 30328 slave.cpp:1120] Ignoring shutdown framework > 201304251337-16842879-37747-30328-0000 because it is terminating > [Thread 0x7fffb1d01700 (LWP 30374) exited] > [Thread 0x7fffab7fe700 (LWP 30376) exited] > > Program received signal SIGPIPE, Broken pipe. > [Switching to Thread 0x7fffb230e700 (LWP 30370)] > 0x00007ffff6724ccd in write () from /lib/x86_64-linux-gnu/libpthread.so.0 > (gdb) where > #0 0x00007ffff6724ccd in write () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #1 0x00007fffb2316ab2 in Java_sun_nio_ch_FileDispatcherImpl_write0 () > from /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/libnio.so > #2 0x00007fffcc39ef90 in ?? () > #3 0x0000000000000000 in ?? () > > That error is occurring somewhere deep down inside zookeeper, and I don't > really know what's causing it, other than obviously something related to the > slave not being fully shut down yet when we shut down the master. If it seems > important to you, I can continue investigating, but I suspect that its a > quirk of how our test infrastructure interacts with zookeeper, and I don't > think that its likely to come up in practice. > > Benjamin Hindman wrote: > So, I don't see an error in the output that you've shown. When running > tests with gdb and a JVM (which you get with ZooKeeper tests) you need to > ignore most signals since that's the JVMs normal operation. I did see a a > segfault that the JVM propagated in a previous test run, but it's not clear > to me why this fix will cover that segfault ... > > Thomas Marshall wrote: > Sorry, I didn't realize that was expected behavior. In that case, the > problem this patch is solving looks something more like: > > ... > I0426 10:29:39.232862 12073 master.cpp:774] Asked to unregister framework > 201304261029-16842879-56235-12059-0000 > I0426 10:29:39.233058 12073 master.hpp:300] Removing task with resources > cpus=1; mem=500 on slave 201304261029-16842879-56235-12059-0 > I0426 10:29:39.233082 12079 hierarchical_allocator_process.hpp:359] > Deactivated framework 201304261029-16842879-56235-12059-0000 > I0426 10:29:39.233254 12073 master.hpp:318] Removing offer with resources > cpus=1; mem=524; ports=[31000-32000]; disk=9053 on slave > 201304261029-16842879-56235-12059-0 > I0426 10:29:39.233487 12079 hierarchical_allocator_process.hpp:544] > Recovered cpus=1; mem=500 (total allocatable: cpus=1; mem=500; ports=[]; > disk=0) on slave 201304261029-16842879-56235-12059-0 from framework > 201304261029-16842879-56235-12059-0000 > I0426 10:29:39.233636 12073 master.cpp:477] Master terminating > I0426 10:29:39.233777 12059 master.cpp:283] Shutting down master > I0426 10:29:39.233770 12079 hierarchical_allocator_process.hpp:544] > Recovered cpus=1; mem=524; ports=[31000-32000]; disk=9053 (total allocatable: > cpus=2; mem=1024; ports=[31000-32000]; disk=9053) on slave > 201304261029-16842879-56235-12059-0 from framework > 201304261029-16842879-56235-12059-0000 > I0426 10:29:39.233077 12074 slave.cpp:1099] Asked to shut down framework > 201304261029-16842879-56235-12059-0000 by [email protected]:56235 > W0426 10:29:39.234302 12074 slave.cpp:1120] Ignoring shutdown framework > 201304261029-16842879-56235-12059-0000 because it is terminating > I0426 10:29:39.234310 12078 hierarchical_allocator_process.hpp:423] > Removed slave 201304261029-16842879-56235-12059-0 > I0426 10:29:39.234422 12074 slave.cpp:486] Slave asked to shut down by > [email protected]:56235 > I0426 10:29:39.234792 12074 slave.cpp:1099] Asked to shut down framework > 201304261029-16842879-56235-12059-0000 by [email protected]:56235 > W0426 10:29:39.234848 12074 slave.cpp:1120] Ignoring shutdown framework > 201304261029-16842879-56235-12059-0000 because it is terminating > I0426 10:29:39.234915 12074 slave.cpp:441] Slave terminating > I0426 10:29:39.234956 12074 slave.cpp:1099] Asked to shut down framework > 201304261029-16842879-56235-12059-0000 by @0.0.0.0:0 > W0426 10:29:39.234998 12074 slave.cpp:1120] Ignoring shutdown framework > 201304261029-16842879-56235-12059-0000 because it is terminating > pure virtual method called > terminate called without an active exception > Aborted (core dumped)
i've seen this "pure virtual method" error before, but unsure of the cause. do you know why/how your patch fixes this? - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10786/#review19724 ----------------------------------------------------------- On April 25, 2013, 8:05 p.m., Thomas Marshall wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10786/ > ----------------------------------------------------------- > > (Updated April 25, 2013, 8:05 p.m.) > > > Review request for mesos, Benjamin Hindman, Vinod Kone, and Ben Mahler. > > > Description > ------- > > See summary. > > > This addresses bug MESOS-441. > https://issues.apache.org/jira/browse/MESOS-441 > > > Diffs > ----- > > src/tests/allocator_zookeeper_tests.cpp 2c7deb1 > > Diff: https://reviews.apache.org/r/10786/diff/ > > > Testing > ------- > > bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=1000 > --gtest_filter=*AllocatorZoo* > > > Thanks, > > Thomas Marshall > >
