It looks like stime is wacky:

I0701 20:16:45.686691 16005 process_isolator.cpp:99] Launching
> executor_Task_Tracker_39837 (cd hadoop-* && ./bin/mesos-executor) in
> /tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5
> with resources cpus=1; mem=5000' for framework
> 201306291951-3660134922-5050-13580-0002
> I0701 20:16:45.687299 15993 slave.cpp:510] Successfully attached file
> '/tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5'
> I0701 20:16:45.689007 16005 process_isolator.cpp:161] Forked executor at
> 16141
> Fetching resources into
> '/tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5'
> Fetching resource 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> Downloading resource from
> 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> HDFS command: hadoop fs -copyToLocal
> 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> Extracting resource: tar xJf './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
> utime=0 stime=0 ticks=100
> utime=0 stime=1 ticks=100
> utime=0 stime=1 ticks=100
> utime=0 stime=1 ticks=100
> utime=0 stime=1 ticks=100
> utime=0 stime=1 ticks=100
> utime=0 stime=1 ticks=100
> utime=80 stime=537 ticks=100
> utime=9 stime=0 ticks=100
> utime=0 stime=2432 ticks=100
> utime=0 stime=4344 ticks=100
> utime=1589 stime=0 ticks=100
> utime=0 stime=1608 ticks=100
> utime=0 stime=0 ticks=100
> utime=2 stime=4286 ticks=100
> utime=1139 stime=0 ticks=100
> utime=0 stime=1571 ticks=100
> utime=0 stime=0 ticks=100
> utime=0 stime=1740 ticks=100
> utime=158 stime=958 ticks=100
> utime=0 stime=1403 ticks=100
> utime=35 stime=1527 ticks=100
> utime=1033 stime=0 ticks=100
> utime=0 stime=1514 ticks=100
> utime=8 stime=1454 ticks=100
> utime=1046 stime=0 ticks=100
> utime=0 stime=1389 ticks=100
> utime=1 stime=1309 ticks=100
> utime=984 stime=0 ticks=100
> utime=0 stime=1554 ticks=100
> utime=22 stime=1209 ticks=100
> utime=1015 stime=0 ticks=100
> utime=0 stime=1316 ticks=100
> utime=2 stime=1299 ticks=100
> utime=637 stime=328 ticks=100
> utime=0 stime=1736 ticks=100
> utime=53 stime=1091 ticks=100
> utime=1080 stime=0 ticks=100
> utime=0 stime=1503 ticks=100
> utime=49 stime=909 ticks=100
> utime=864 stime=0 ticks=100
> utime=0 stime=1447 ticks=100
> utime=2 stime=1123 ticks=100
> utime=887 stime=0 ticks=100
> utime=0 stime=1583 ticks=100
> utime=20 stime=1104 ticks=100
> utime=568 stime=319 ticks=100
> utime=0 stime=1582 ticks=100
> utime=20 stime=1041 ticks=100
> utime=928 stime=0 ticks=100
> utime=0 stime=1380 ticks=100
> utime=27 stime=1061 ticks=100
> utime=911 stime=0 ticks=100
> utime=0 stime=1500 ticks=100
> utime=57 stime=1031 ticks=100
> utime=996 stime=0 ticks=100
> utime=0 stime=1678 ticks=100
> utime=3 stime=1018 ticks=100
> utime=976 stime=0 ticks=100
> utime=0 stime=1164 ticks=100
> utime=0 stime=0 ticks=100
> utime=39 stime=617 ticks=100
> utime=938 stime=0 ticks=100
> utime=0 stime=1276 ticks=100
> utime=0 stime=0 ticks=100
> utime=42 stime=667 ticks=100
> utime=919 stime=0 ticks=100
> utime=0 stime=1308 ticks=100
> utime=0 stime=0 ticks=100
> utime=24 stime=747 ticks=100
> utime=1156 stime=0 ticks=100
> utime=0 stime=1863 ticks=100
> utime=0 stime=0 ticks=100
> utime=8 stime=616 ticks=100
> utime=894 stime=0 ticks=100
> utime=0 stime=1036 ticks=100
> utime=0 stime=0 ticks=100
> utime=6 stime=740 ticks=100
> utime=933 stime=0 ticks=100
> utime=0 stime=975 ticks=100
> utime=0 stime=0 ticks=100
> utime=140 stime=491 ticks=100
> utime=942 stime=0 ticks=100
> utime=0 stime=1117 ticks=100
> utime=0 stime=0 ticks=100
> utime=116 stime=578 ticks=100
> utime=869 stime=0 ticks=100
> utime=0 stime=1121 ticks=100
> utime=0 stime=0 ticks=100
> utime=52 stime=1844674407370 ticks=100
> Try::get() but state == ERROR: Argument larger than the maximum number of
> seconds that a Duration can represent due to int64_t's size limit.
> *** Aborted at 1372709810 (unix time) try "date -d @1372709810" if you are
> using GNU date ***
> PC: @     0x7f80329a7425 (unknown)
> *** SIGABRT (@0x3e4c) received by PID 15948 (TID 0x7f802a1e5700) from PID
> 15948; stack trace: ***
>     @     0x7f8032d3fcb0 (unknown)
>     @     0x7f80329a7425 (unknown)
>     @     0x7f80329aab8b (unknown)
>     @     0x7f803384c63a os::process()
>     @     0x7f803384e702 os::processes()
>     @     0x7f80338507bc os::children()
>     @     0x7f80338446b3 mesos::internal::slave::ProcessIsolator::usage()
>     @     0x7f80337cf690 std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f80337dd606 process::internal::pdispatcher<>()
>     @     0x7f80337d08e8 std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f80339c215c process::ProcessManager::resume()
>     @     0x7f80339c2dbc process::schedule()
>     @     0x7f8032d37e9a start_thread
>     @     0x7f8032a64ccd (unknown)
> I0701 20:16:54.926120 16220 main.cpp:119] Creating "process" isolator
> I0701 20:16:54.926836 16220 main.cpp:127] Build: 2013-06-15 01:24:12 by
> root
> I0701 20:16:54.926872 16220 main.cpp:128] Starting Mesos slave




On Mon, Jul 1, 2013 at 12:30 PM, Brenden Matthews <
[email protected]> wrote:

> Not yet!  As soon as I have some samples I'll pass them along.
>
>
> On Mon, Jul 1, 2013 at 12:25 PM, Benjamin Mahler <
> [email protected]> wrote:
>
>> Cool, do you have the values handy?
>>
>>
>> On Mon, Jul 1, 2013 at 12:18 PM, Brenden Matthews <
>> [email protected]> wrote:
>>
>> > I did a quick patch to print out the values.  It seems to happen
>> frequently
>> > on some slaves, and not at all on others.
>> >
>> >
>> > On Mon, Jul 1, 2013 at 12:07 PM, Benjamin Mahler
>> > <[email protected]>wrote:
>> >
>> > > Looks likely, would have been nice if we printed it out! ;)
>> > >
>> > > I'm curious, is this something rare? Or is it crashing consistently?
>> The
>> > > former would point to something odd with the values in
>> /proc/<pid>/stat,
>> > > the latter would point to an issue with our code.
>> > >
>> > > I can send out a fix today, but without being able to reproduce it
>> would
>> > > simply be ignoring the strange utime / stime values.
>> > >
>> > >
>> > > On Mon, Jul 1, 2013 at 11:12 AM, Brenden Matthews <
>> > > [email protected]> wrote:
>> > >
>> > > > Hey guys,
>> > > >
>> > > > I'm getting another slave crash with the process usage stuff.
>>  Here's
>> > the
>> > > > log:
>> > > >
>> > > > I0701 16:44:51.263236 11682 slave.cpp:528] New master detected at
>> > > > > [email protected]:5050
>> > > > > I0701 16:44:51.263598 11672 gc.cpp:56] Scheduling
>> > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-1538' for
>> > removal
>> > > > > I0701 16:44:51.264078 11676 status_update_manager.cpp:155] New
>> master
>> > > > > detected at [email protected]:5050
>> > > > > I0701 16:44:51.283917 11672 slave.cpp:588] Registered with master
>> > > > > [email protected]:5050; given slave ID
>> > > > > 201306291951-3660134922-5050-13580-2759
>> > > > > I0701 16:44:52.194198 11657 slave.cpp:1413] Got registration for
>> > > executor
>> > > > > 'executor_Task_Tracker_37434' of framework
>> > > > > 201306291951-3660134922-5050-13580-0002
>> > > > > W0701 16:44:52.194744 11657 slave.cpp:1438]  Shutting down
>> executor
>> > > > > 'executor_Task_Tracker_37434' as the framework
>> > > > > 201306291951-3660134922-5050-13580-0002 does not
>> > > > > exist
>> > > > > I0701 16:44:56.630949 11666 slave.cpp:738] Got assigned task
>> > > > > Task_Tracker_37439 for framework
>> > > 201306291951-3660134922-5050-13580-0002
>> > > > > I0701 16:44:56.632294 11666 slave.cpp:836] Launching task
>> > > > > Task_Tracker_37439 for framework
>> > > 201306291951-3660134922-5050-13580-0002
>> > > > > I0701 16:44:56.634282 11666 paths.hpp:303] Created executor
>> directory
>> > > > >
>> > > >
>> > >
>> >
>> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> 5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d'
>> > > > > I0701 16:44:56.634918 11666 slave.cpp:947] Queuing task
>> > > > > 'Task_Tracker_37439' for executor executor_Task_Tracker_37439 of
>> > > > framework
>> > > > > '201306291951-3660134922-5050-135
>> > > > > 80-0002
>> > > > > I0701 16:44:56.634908 11683 process_isolator.cpp:99] Launching
>> > > > > executor_Task_Tracker_37439 (cd hadoop-* && ./bin/mesos-executor)
>> in
>> > > > > /tmp/mesos/slaves/201306291951-3
>> > > > >
>> > > >
>> > >
>> >
>> 660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d
>> > > > > with re
>> > > > > sources cpus=1; mem=5000' for framework
>> > > > > 201306291951-3660134922-5050-13580-0002
>> > > > > I0701 16:44:56.637537 11666 slave.cpp:510] Successfully attached
>> file
>> > > > >
>> > > >
>> > >
>> >
>> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d'
>> > > > > I0701 16:44:56.637923 11683 process_isolator.cpp:161] Forked
>> executor
>> > > at
>> > > > > 11809
>> > > > > Fetching resources into
>> > > > >
>> > > >
>> > >
>> >
>> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d'
>> > > > > Fetching resource
>> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > Downloading resource from
>> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > HDFS command: hadoop fs -copyToLocal
>> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > Extracting resource: tar xJf
>> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz'
>> > > > > Try::get() but state == ERROR: Argument larger than the maximum
>> > number
>> > > of
>> > > > > seconds that a Duration can represent due to int64_t's size limit.
>> > > > > *** Aborted at 1372697101 (unix time) try "date -d @1372697101" if
>> > you
>> > > > are
>> > > > > using GNU date ***
>> > > > > PC: @     0x7f907ac82425 (unknown)
>> > > > > *** SIGABRT (@0x2d69) received by PID 11625 (TID 0x7f906e4b8700)
>> from
>> > > PID
>> > > > > 11625; stack trace: ***
>> > > > >     @     0x7f907b01acb0 (unknown)
>> > > > >     @     0x7f907ac82425 (unknown)
>> > > > >     @     0x7f907ac85b8b (unknown)
>> > > > >     @     0x7f907bb274ea os::process()
>> > > > >     @     0x7f907bb296d2 os::processes()
>> > > > >     @     0x7f907bb2b78c os::children()
>> > > > >     @     0x7f907bb1f5d3
>> > > mesos::internal::slave::ProcessIsolator::usage()
>> > > > >     @     0x7f907baaa5b0
>> std::tr1::_Function_handler<>::_M_invoke()
>> > > > >     @     0x7f907bab8526 process::internal::pdispatcher<>()
>> > > > >     @     0x7f907baab808
>> std::tr1::_Function_handler<>::_M_invoke()
>> > > > >     @     0x7f907bc9d17c process::ProcessManager::resume()
>> > > > >     @     0x7f907bc9dddc process::schedule()
>> > > > >     @     0x7f907b012e9a start_thread
>> > > > >     @     0x7f907ad3fccd (unknown)
>> > > > > I0701 16:45:02.274014 11899 main.cpp:119] Creating "process"
>> isolator
>> > > > > I0701 16:45:02.274749 11899 main.cpp:127] Build: 2013-06-18
>> 01:38:35
>> > by
>> > > > > I0701 16:45:02.274782 11899 main.cpp:128] Starting Mesos slave
>> > > >
>> > > >
>> > > >
>> > > > Here's the gdb backtrace:
>> > > >
>> > > > (gdb) bt
>> > > > > #0  0x00007f8da4577425 in raise () from
>> > /lib/x86_64-linux-gnu/libc.so.6
>> > > > > #1  0x00007f8da457ab8b in abort () from
>> > /lib/x86_64-linux-gnu/libc.so.6
>> > > > > #2  0x00007f8da541c4ea in get (this=0x7f8d965a91e0) at
>> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:66
>> > > > > #3  os::process (pid=<optimized out>) at
>> > > > >
>> > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os/linux.hpp:57
>> > > > > #4  0x00007f8da541e6d2 in os::processes () at
>> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:984
>> > > > > #5  0x00007f8da542078c in os::children (pid=12260,
>> recursive=true) at
>> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:997
>> > > > > #6  0x00007f8da54145d3 in
>> > > mesos::internal::slave::ProcessIsolator::usage
>> > > > > (this=<optimized out>, frameworkId=..., executorId=...)
>> > > > >     at ../../src/slave/process_isolator.cpp:396
>> > > > > #7  0x00007f8da539f5b0 in operator() (__args#1=..., __args#0=...,
>> > > > > this=<optimized out>, __object=<optimized out>) at
>> > > > > /usr/include/c++/4.6/tr1/functional:572
>> > > > > #8  __call<mesos::internal::slave::Isolator*&, 0, 1, 2>
>> (__args=...,
>> > > > > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
>> > > > > #9  operator()<mesos::internal::slave::Isolator*> (this=<optimized
>> > > out>)
>> > > > > at /usr/include/c++/4.6/tr1/functional:1207
>> > > > > #10
>> > > >
>> std::tr1::_Function_handler<process::Future<mesos::ResourceStatistics>
>> > > > > (mesos::internal::slave::Isolator*),
>> > > > >
>> > > >
>> > >
>> >
>> std::tr1::_Bind<std::tr1::_Mem_fn<process::Future<mesos::ResourceStatistics>
>> > > > > (mesos::internal::slave::Isolator::*)(mesos::FrameworkID const&,
>> > > > > mesos::ExecutorID const&)> (std::tr1::_Placeholder<1>,
>> > > > mesos::FrameworkID,
>> > > > > mesos::ExecutorID)> >::_M_invoke(std::tr1::_Any_data const&,
>> > > > > mesos::internal::slave::Isolator*) (__functor=...,
>> > __args#0=<optimized
>> > > > out>)
>> > > > >     at /usr/include/c++/4.6/tr1/functional:1670
>> > > > > #11 0x00007f8da53ad526 in operator() (__args#0=<optimized out>,
>> > > > > this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:2040
>> > > > > #12 process::internal::pdispatcher<mesos::ResourceStatistics,
>> > > > > mesos::internal::slave::Isolator>(process::ProcessBase*,
>> > > > >
>> > > >
>> > >
>> >
>> std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics>
>> > > > > (mesos::internal::slave::Isolator*)> >,
>> > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics>
>> >) (
>> > > > >     process=<optimized out>, thunk=..., promise=...) at
>> > > > > ../../3rdparty/libprocess/include/process/dispatch.hpp:86
>> > > > > #13 0x00007f8da53a0808 in __call<process::ProcessBase*&, 0, 1, 2>
>> > > > > (__args=..., this=<optimized out>) at
>> > > > > /usr/include/c++/4.6/tr1/functional:1153
>> > > > > #14 operator()<process::ProcessBase*> (this=<optimized out>) at
>> > > > > /usr/include/c++/4.6/tr1/functional:1207
>> > > > > #15 std::tr1::_Function_handler<void (process::ProcessBase*),
>> > > > > std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
>> > > > >
>> > > >
>> > >
>> >
>> std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics>
>> > > > > (mesos::internal::slave::Isolator*)> >,
>> > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics>
>> > > > > >))(process::ProcessBase*,
>> > > > >
>> > > >
>> > >
>> >
>> std::tr1::shared_ptr<std::tr1::function<process::Future<mesos::ResourceStatistics>
>> > > > > (mesos::internal::slave::Isolator*)> >,
>> > > > > std::tr1::shared_ptr<process::Promise<mesos::ResourceStatistics>
>> >)>
>> > > > > >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*)
>> > > > > (__functor=..., __args#0=<optimized out>)
>> > > > >     at /usr/include/c++/4.6/tr1/functional:1684
>> > > > > #16 0x00007f8da559217c in process::ProcessManager::resume
>> > > (this=0xeedf20,
>> > > > > process=0xf03df8) at
>> > ../../../3rdparty/libprocess/src/process.cpp:2446
>> > > > > #17 0x00007f8da5592ddc in process::schedule (arg=<optimized out>)
>> at
>> > > > > ../../../3rdparty/libprocess/src/process.cpp:1175
>> > > > > #18 0x00007f8da4907e9a in start_thread () from
>> > > > > /lib/x86_64-linux-gnu/libpthread.so.0
>> > > > > #19 0x00007f8da4634ccd in clone () from
>> > /lib/x86_64-linux-gnu/libc.so.6
>> > > > > #20 0x0000000000000000 in ?? ()
>> > > > > (gdb)
>> > > >
>> > > >
>> > > >
>> > > > It looks like either `ticks` or `utime`/`stime` has an erroneous
>> value.
>> > > >  Thoughts?
>> > > >
>> > >
>> >
>>
>
>

Reply via email to