Cool, do you have the values handy?
On Mon, Jul 1, 2013 at 12:18 PM, Brenden Matthews < [email protected]> wrote: > I did a quick patch to print out the values. It seems to happen frequently > on some slaves, and not at all on others. > > > On Mon, Jul 1, 2013 at 12:07 PM, Benjamin Mahler > <[email protected]>wrote: > > > Looks likely, would have been nice if we printed it out! ;) > > > > I'm curious, is this something rare? Or is it crashing consistently? The > > former would point to something odd with the values in /proc/<pid>/stat, > > the latter would point to an issue with our code. > > > > I can send out a fix today, but without being able to reproduce it would > > simply be ignoring the strange utime / stime values. > > > > > > On Mon, Jul 1, 2013 at 11:12 AM, Brenden Matthews < > > [email protected]> wrote: > > > > > Hey guys, > > > > > > I'm getting another slave crash with the process usage stuff. Here's > the > > > log: > > > > > > I0701 16:44:51.263236 11682 slave.cpp:528] New master detected at > > > > [email protected]:5050 > > > > I0701 16:44:51.263598 11672 gc.cpp:56] Scheduling > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-1538' for > removal > > > > I0701 16:44:51.264078 11676 status_update_manager.cpp:155] New master > > > > detected at [email protected]:5050 > > > > I0701 16:44:51.283917 11672 slave.cpp:588] Registered with master > > > > [email protected]:5050; given slave ID > > > > 201306291951-3660134922-5050-13580-2759 > > > > I0701 16:44:52.194198 11657 slave.cpp:1413] Got registration for > > executor > > > > 'executor_Task_Tracker_37434' of framework > > > > 201306291951-3660134922-5050-13580-0002 > > > > W0701 16:44:52.194744 11657 slave.cpp:1438] Shutting down executor > > > > 'executor_Task_Tracker_37434' as the framework > > > > 201306291951-3660134922-5050-13580-0002 does not > > > > exist > > > > I0701 16:44:56.630949 11666 slave.cpp:738] Got assigned task > > > > Task_Tracker_37439 for framework > > 201306291951-3660134922-5050-13580-0002 > > > > I0701 16:44:56.632294 11666 slave.cpp:836] Launching task > > > > Task_Tracker_37439 for framework > > 201306291951-3660134922-5050-13580-0002 > > > > I0701 16:44:56.634282 11666 paths.hpp:303] Created executor directory > > > > > > > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922- > > > > > > > > > > > > > > 5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' > > > > I0701 16:44:56.634918 11666 slave.cpp:947] Queuing task > > > > 'Task_Tracker_37439' for executor executor_Task_Tracker_37439 of > > > framework > > > > '201306291951-3660134922-5050-135 > > > > 80-0002 > > > > I0701 16:44:56.634908 11683 process_isolator.cpp:99] Launching > > > > executor_Task_Tracker_37439 (cd hadoop-* && ./bin/mesos-executor) in > > > > /tmp/mesos/slaves/201306291951-3 > > > > > > > > > > 660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d > > > > with re > > > > sources cpus=1; mem=5000' for framework > > > > 201306291951-3660134922-5050-13580-0002 > > > > I0701 16:44:56.637537 11666 slave.cpp:510] Successfully attached file > > > > > > > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' > > > > I0701 16:44:56.637923 11683 process_isolator.cpp:161] Forked executor > > at > > > > 11809 > > > > Fetching resources into > > > > > > > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' > > > > Fetching resource > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > Downloading resource from > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > HDFS command: hadoop fs -copyToLocal > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > Extracting resource: tar xJf > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > > > > Try::get() but state == ERROR: Argument larger than the maximum > number > > of > > > > seconds that a Duration can represent due to int64_t's size limit. > > > > *** Aborted at 1372697101 (unix time) try "date -d @1372697101" if > you > > > are > > > > using GNU date *** > > > > PC: @ 0x7f907ac82425 (unknown) > > > > *** SIGABRT (@0x2d69) received by PID 11625 (TID 0x7f906e4b8700) from > > PID > > > > 11625; stack trace: *** > > > > @ 0x7f907b01acb0 (unknown) > > > > @ 0x7f907ac82425 (unknown) > > > > @ 0x7f907ac85b8b (unknown) > > > > @ 0x7f907bb274ea os::process() > > > > @ 0x7f907bb296d2 os::processes() > > > > @ 0x7f907bb2b78c os::children() > > > > @ 0x7f907bb1f5d3 > > mesos::internal::slave::ProcessIsolator::usage() > > > > @ 0x7f907baaa5b0 std::tr1::_Function_handler<>::_M_invoke() > > > > @ 0x7f907bab8526 process::internal::pdispatcher<>() > > > > @ 0x7f907baab808 std::tr1::_Function_handler<>::_M_invoke() > > > > @ 0x7f907bc9d17c process::ProcessManager::resume() > > > > @ 0x7f907bc9dddc process::schedule() > > > > @ 0x7f907b012e9a start_thread > > > > @ 0x7f907ad3fccd (unknown) > > > > I0701 16:45:02.274014 11899 main.cpp:119] Creating "process" isolator > > > > I0701 16:45:02.274749 11899 main.cpp:127] Build: 2013-06-18 01:38:35 > by > > > > I0701 16:45:02.274782 11899 main.cpp:128] Starting Mesos slave > > > > > > > > > > > > Here's the gdb backtrace: > > > > > > (gdb) bt > > > > #0 0x00007f8da4577425 in raise () from > /lib/x86_64-linux-gnu/libc.so.6 > > > > #1 0x00007f8da457ab8b in abort () from > /lib/x86_64-linux-gnu/libc.so.6 > > > > #2 0x00007f8da541c4ea in get (this=0x7f8d965a91e0) at > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:66 > > > > #3 os::process (pid=<optimized out>) at > > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os/linux.hpp:57 > > > > #4 0x00007f8da541e6d2 in os::processes () at > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:984 > > > > #5 0x00007f8da542078c in os::children (pid=12260, recursive=true) at > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:997 > > > > #6 0x00007f8da54145d3 in > > mesos::internal::slave::ProcessIsolator::usage > > > > (this=<optimized out>, frameworkId=..., executorId=...) > > > > at ../../src/slave/process_isolator.cpp:396 > > > > #7 0x00007f8da539f5b0 in operator() (__args#1=..., __args#0=..., > > > > this=<optimized out>, __object=<optimized out>) at > > > > /usr/include/c++/4.6/tr1/functional:572 > > > > #8 __call<mesos::internal::slave::Isolator*&, 0, 1, 2> (__args=..., > > > > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153 > > > > #9 operator()<mesos::internal::slave::Isolator*> (this=<optimized > > out>) > > > > at /usr/include/c++/4.6/tr1/functional:1207 > > > > #10 > > > std::tr1::_Function_handler<process::Future<mesos::ResourceStatistics> > > > > (mesos::internal::slave::Isolator*), > > > > > > > > > > std::tr1::_Bind<std::tr1::_Mem_fn<process::Future<mesos::ResourceStatistics> > > > > (mesos::internal::slave::Isolator::*)(mesos::FrameworkID const&, > > > > mesos::ExecutorID const&)> (std::tr1::_Placeholder<1>, > > > mesos::FrameworkID, > > > > mesos::ExecutorID)> >::_M_invoke(std::tr1::_Any_data const&, > > > > mesos::internal::slave::Isolator*) (__functor=..., > __args#0=<optimized > > > out>) > > > > at /usr/include/c++/4.6/tr1/functional:1670 > > > > #11 0x00007f8da53ad526 in operator() (__args#0=<optimized out>, > > > > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:2040 > > > > #12 process::internal::pdispatcher<mesos::ResourceStatistics, > > > > mesos::internal::slave::Isolator>(process::ProcessBase*, > > > > > > > > > > std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics> > > > > (mesos::internal::slave::Isolator*)> >, > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics> >) ( > > > > process=<optimized out>, thunk=..., promise=...) at > > > > ../../3rdparty/libprocess/include/process/dispatch.hpp:86 > > > > #13 0x00007f8da53a0808 in __call<process::ProcessBase*&, 0, 1, 2> > > > > (__args=..., this=<optimized out>) at > > > > /usr/include/c++/4.6/tr1/functional:1153 > > > > #14 operator()<process::ProcessBase*> (this=<optimized out>) at > > > > /usr/include/c++/4.6/tr1/functional:1207 > > > > #15 std::tr1::_Function_handler<void (process::ProcessBase*), > > > > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>, > > > > > > > > > > std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics> > > > > (mesos::internal::slave::Isolator*)> >, > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics> > > > > >))(process::ProcessBase*, > > > > > > > > > > std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics> > > > > (mesos::internal::slave::Isolator*)> >, > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics> >)> > > > > >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) > > > > (__functor=..., __args#0=<optimized out>) > > > > at /usr/include/c++/4.6/tr1/functional:1684 > > > > #16 0x00007f8da559217c in process::ProcessManager::resume > > (this=0xeedf20, > > > > process=0xf03df8) at > ../../../3rdparty/libprocess/src/process.cpp:2446 > > > > #17 0x00007f8da5592ddc in process::schedule (arg=<optimized out>) at > > > > ../../../3rdparty/libprocess/src/process.cpp:1175 > > > > #18 0x00007f8da4907e9a in start_thread () from > > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > > #19 0x00007f8da4634ccd in clone () from > /lib/x86_64-linux-gnu/libc.so.6 > > > > #20 0x0000000000000000 in ?? () > > > > (gdb) > > > > > > > > > > > > It looks like either `ticks` or `utime`/`stime` has an erroneous value. > > > Thoughts? > > > > > >
