[ 
https://issues.apache.org/jira/browse/MESOS-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035631#comment-16035631
 ] 

Neil Conway commented on MESOS-1606:
------------------------------------

Perhaps a disk I/O error, e.g., due to a flaky disk?

> Slave failed to checkpoint on Mac OS X
> --------------------------------------
>
>                 Key: MESOS-1606
>                 URL: https://issues.apache.org/jira/browse/MESOS-1606
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>         Environment: Mac OS X, Darwin Kernel Version 13.3.0
>            Reporter: Zuyu Zhang
>
> {noformat}
> This bug happens to test_framework and LowLevelSchedulerLibprocess as well.
> [ RUN      ] ExamplesTest.LowLevelSchedulerPthread
> Using temporary directory '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al'
> Enabling authentication for the scheduler
> I0715 19:03:59.296200 2019271440 scheduler.cpp:132] Version: 0.20.0
> I0715 19:03:59.300429 2019271440 leveldb.cpp:176] Opened db in 1982us
> I0715 19:03:59.300900 2019271440 leveldb.cpp:183] Compacted db in 447us
> I0715 19:03:59.300946 2019271440 leveldb.cpp:198] Created db iterator in 27us
> I0715 19:03:59.300978 2019271440 leveldb.cpp:204] Seeked to beginning of db 
> in 16us
> I0715 19:03:59.301007 2019271440 leveldb.cpp:273] Iterated through 0 keys in 
> the db in 20us
> I0715 19:03:59.301053 2019271440 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0715 19:03:59.301713 222965760 recover.cpp:425] Starting replica recovery
> I0715 19:03:59.301914 222965760 recover.cpp:451] Replica is in EMPTY status
> I0715 19:03:59.302671 221892608 replica.cpp:638] Replica in EMPTY status 
> received a broadcasted recover request
> I0715 19:03:59.302781 224575488 recover.cpp:188] Received a recover response 
> from a replica in EMPTY status
> I0715 19:03:59.303050 225112064 recover.cpp:542] Updating replica status to 
> STARTING
> I0715 19:03:59.303432 222965760 leveldb.cpp:306] Persisting metadata (8 
> bytes) to leveldb took 298us
> I0715 19:03:59.303475 222965760 replica.cpp:320] Persisted replica status to 
> STARTING
> I0715 19:03:59.303540 221356032 recover.cpp:451] Replica is in STARTING status
> I0715 19:03:59.303797 224575488 master.cpp:288] Master 
> 20140715-190359-16777343-64313-60122 (localhost) started on 127.0.0.1:64313
> I0715 19:03:59.303848 224575488 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0715 19:03:59.303865 224575488 master.cpp:332] Master allowing 
> unauthenticated slaves to register
> I0715 19:03:59.303884 224575488 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials'
> W0715 19:03:59.303961 224575488 credentials.hpp:51] Permissions on 
> credentials file 
> '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials' are too open. 
> It is recommended that your credentials file is NOT accessible by others.
> I0715 19:03:59.304028 224575488 master.cpp:359] Authorization enabled
> I0715 19:03:59.304379 223502336 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0715 19:03:59.304505 2019271440 containerizer.cpp:124] Using isolation: 
> posix/cpu,posix/mem
> I0715 19:03:59.304666 223502336 recover.cpp:188] Received a recover response 
> from a replica in STARTING status
> I0715 19:03:59.304805 223502336 recover.cpp:542] Updating replica status to 
> VOTING
> I0715 19:03:59.305186 223502336 leveldb.cpp:306] Persisting metadata (8 
> bytes) to leveldb took 214us
> I0715 19:03:59.305219 223502336 replica.cpp:320] Persisted replica status to 
> VOTING
> I0715 19:03:59.305250 223502336 recover.cpp:556] Successfully joined the 
> Paxos group
> I0715 19:03:59.305361 223502336 recover.cpp:440] Recover process terminated
> I0715 19:03:59.305927 224038912 slave.cpp:168] Slave started on 
> 1)@127.0.0.1:64313
> I0715 19:03:59.306221 224038912 slave.cpp:279] Slave resources: cpus(*):4; 
> mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.306234 2019271440 containerizer.cpp:124] Using isolation: 
> posix/cpu,posix/mem
> I0715 19:03:59.306248 223502336 master.cpp:1128] The newly elected leader is 
> master@127.0.0.1:64313 with id 20140715-190359-16777343-64313-60122
> I0715 19:03:59.306269 223502336 master.cpp:1141] Elected as the leading 
> master!
> I0715 19:03:59.306293 223502336 master.cpp:959] Recovering from registrar
> I0715 19:03:59.306395 225112064 registrar.cpp:313] Recovering registrar
> I0715 19:03:59.306617 221892608 log.cpp:656] Attempting to start the writer
> I0715 19:03:59.306952 224575488 slave.cpp:168] Slave started on 
> 2)@127.0.0.1:64313
> I0715 19:03:59.307158 224575488 slave.cpp:279] Slave resources: cpus(*):4; 
> mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.307207 222965760 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0715 19:03:59.307401 224038912 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.307459 224038912 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.307446 222965760 leveldb.cpp:306] Persisting metadata (8 
> bytes) to leveldb took 232us
> I0715 19:03:59.307512 222965760 replica.cpp:342] Persisted promised to 1
> I0715 19:03:59.307615 224575488 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.307631 224575488 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.307802 222965760 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0715 19:03:59.307924 223502336 state.cpp:33] Recovering state from 
> '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta'
> I0715 19:03:59.308027 2019271440 containerizer.cpp:124] Using isolation: 
> posix/cpu,posix/mem
> I0715 19:03:59.308171 222429184 status_update_manager.cpp:193] Recovering 
> status update manager
> I0715 19:03:59.308205 225112064 state.cpp:33] Recovering state from 
> '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/1/meta'
> I0715 19:03:59.308316 221892608 containerizer.cpp:287] Recovering 
> containerizer
> I0715 19:03:59.308384 221356032 status_update_manager.cpp:193] Recovering 
> status update manager
> I0715 19:03:59.308575 225112064 containerizer.cpp:287] Recovering 
> containerizer
> I0715 19:03:59.309072 222429184 slave.cpp:3130] Finished recovery
> I0715 19:03:59.309079 223502336 slave.cpp:3130] Finished recovery
> F0715 19:03:59.309267 222429184 slave.cpp:3141] 
> CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to checkpoint 
> '1405473915' to 
> '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id':
>  Failed to open file 
> '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id':
>  No such file or directory
> *** Check failure stack trace: ***
> I0715 19:03:59.309270 221892608 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0715 19:03:59.309516 221892608 leveldb.cpp:343] Persisting action (8 bytes) 
> to leveldb took 219us
> I0715 19:03:59.309502 223502336 slave.cpp:168] Slave started on 
> 3)@127.0.0.1:64313
> I0715 19:03:59.309582 222965760 slave.cpp:603] New master detected at 
> master@127.0.0.1:64313
> I0715 19:03:59.309588 221892608 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.309665 222965760 slave.cpp:639] No credentials provided. 
> Attempting to register without authentication
> I0715 19:03:59.309685 225112064 status_update_manager.cpp:167] New master 
> detected at master@127.0.0.1:64313
> I0715 19:03:59.309798 223502336 slave.cpp:279] Slave resources: cpus(*):4; 
> mem(*):7168; disk(*):470714; ports(*):[31000-32000]
> I0715 19:03:59.310104 224038912 replica.cpp:508] Replica received write 
> request for position 0
> I0715 19:03:59.310331 222965760 slave.cpp:652] Detecting new master
> I0715 19:03:59.310395 224038912 leveldb.cpp:438] Reading position from 
> leveldb took 30us
> I0715 19:03:59.310642 223502336 slave.cpp:324] Slave hostname: localhost
> I0715 19:03:59.310657 223502336 slave.cpp:325] Slave checkpoint: true
> I0715 19:03:59.310689 224038912 leveldb.cpp:343] Persisting action (14 bytes) 
> to leveldb took 227us
> I0715 19:03:59.310722 224038912 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.310936 222965760 replica.cpp:655] Replica received learned 
> notice for position 0
> I0715 19:03:59.311103 222965760 leveldb.cpp:343] Persisting action (16 bytes) 
> to leveldb took 160us
>     @        0x10b3d54f9  google::LogMessage::SendToLog()
> I0715 19:03:59.311158 221892608 state.cpp:33] Recovering state from 
> '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/2/meta'
> I0715 19:03:59.311436 222965760 replica.cpp:676] Persisted action at 0
> I0715 19:03:59.311514 222965760 replica.cpp:661] Replica learned NOP action 
> at position 0
> I0715 19:03:59.311544 221892608 status_update_manager.cpp:193] Recovering 
> status update manager
> I0715 19:03:59.311612 221892608 containerizer.cpp:287] Recovering 
> containerizer
> I0715 19:03:59.311643 222965760 log.cpp:672] Writer started with ending 
> position 0
>     @        0x10b3d5a24  google::LogMessage::Flush()
> I0715 19:03:59.311983 225112064 slave.cpp:3130] Finished recovery
>     @        0x10b3d8b0f  google::LogMessageFatal::~LogMessageFatal()
> I0715 19:03:59.312419 224038912 leveldb.cpp:438] Reading position from 
> leveldb took 43us
> I0715 19:03:59.312515 222965760 slave.cpp:603] New master detected at 
> master@127.0.0.1:64313
> I0715 19:03:59.312854 222965760 slave.cpp:639] No credentials provided. 
> Attempting to register without authentication
> I0715 19:03:59.312891 222965760 slave.cpp:652] Detecting new master
> I0715 19:03:59.312924 222965760 status_update_manager.cpp:167] New master 
> detected at master@127.0.0.1:64313
>     @        0x10b3d60f9  google::LogMessageFatal::~LogMessageFatal()
>     @        0x10ad381b3  _CheckFatal::~_CheckFatal()
>     @        0x10ad37a29  _CheckFatal::~_CheckFatal()
>     @        0x10af8371f  mesos::internal::slave::Slave::__recover()
>     @        0x10b30df43  process::ProcessBase::visit()
>     @        0x10b304d44  process::ProcessManager::resume()
>     @        0x10b30488f  process::schedule()
>     @     0x7fff907b0899  _pthread_body
>     @     0x7fff907b072a  _pthread_start
>     @     0x7fff907b4fc9  thread_start
> ../../src/tests/script.cpp:85: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Abort trap: 6
> make[3]: *** [check-local] Segmentation fault: 11
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to