[ 
https://issues.apache.org/jira/browse/MESOS-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276116#comment-15276116
 ] 

Bernd Mathiske commented on MESOS-3235:
---------------------------------------

It seems doubtful that lengthening the wait time for task completion solves 
much, since successful runs are way shorter than the default 15 seconds, 
typically in the low single digit second range. Machines aren't that slow, are 
they? And we get these failures on machines that are known to be fast 
occasionally as well. I suspect something else is wrong here.

What I have seen in failure logs is that one task somehow has not produced 
status updates all the way up to the AWAIT statement in question - although it 
must have reached the contention barrier which asserts that all tasks have been 
launched as the fetcher has been observed downloading every script. So one 
guess is that something is blocking/eating/delaying status updates at some 
stage - occasionally. In all the cases I have seen the tasks are not launched 
in serial order. And that's exactly why I wrote this test! So we can see if we 
are dealing with concurrency correctly. Too bad we don't know what's failing 
yet. 

If we had a way to reproduce this behavior more often, we could switch on more 
logging and just repeat the test often enough to find something. But repeating 
the test tends to make the problem go away.

Ideas?

> FetcherCacheHttpTest.HttpCachedSerialized and 
> FetcherCacheHttpTest.HttpCachedConcurrent are flaky
> -------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-3235
>                 URL: https://issues.apache.org/jira/browse/MESOS-3235
>             Project: Mesos
>          Issue Type: Bug
>          Components: fetcher, tests
>    Affects Versions: 0.23.0
>            Reporter: Joseph Wu
>            Assignee: haosdent
>              Labels: mesosphere
>             Fix For: 0.27.0
>
>         Attachments: fetchercache_log_centos_6.txt
>
>
> On OSX, {{make clean && make -j8 V=0 check}}:
> {code}
> [----------] 3 tests from FetcherCacheHttpTest
> [ RUN      ] FetcherCacheHttpTest.HttpCachedSerialized
> HTTP/1.1 200 OK
> Date: Fri, 07 Aug 2015 17:23:05 GMT
> Content-Length: 30
> I0807 10:23:05.673596 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:05.675884 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:05.675897 182226944 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:05.683980 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 0
> Forked command at 54363
> sh -c './mesos-fetcher-test-cmd 0'
> E0807 10:23:05.694953 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54363)
> E0807 10:23:05.793927 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.590008 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:06.592244 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.592243 353255424 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:06.597995 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 1
> Forked command at 54411
> sh -c './mesos-fetcher-test-cmd 1'
> E0807 10:23:06.608708 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54411)
> E0807 10:23:06.707649 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> ../../src/tests/fetcher_cache_tests.cpp:860: Failure
> Failed to wait 15secs for awaitFinished(task.get())
> *** Aborted at 1438968214 (unix time) try "date -d @1438968214" if you are 
> using GNU date ***
> [  FAILED  ] FetcherCacheHttpTest.HttpCachedSerialized (28685 ms)
> [ RUN      ] FetcherCacheHttpTest.HttpCachedConcurrent
> PC: @        0x113723618 process::Owned<>::get()
> *** SIGSEGV (@0x0) received by PID 52313 (TID 0x118d59000) stack trace: ***
>     @     0x7fff8fcacf1a _sigtramp
>     @     0x7f9bc3109710 (unknown)
>     @        0x1136f07e2 mesos::internal::slave::Fetcher::fetch()
>     @        0x113862f9d 
> mesos::internal::slave::MesosContainerizerProcess::fetch()
>     @        0x1138f1b5d 
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS2_11ContainerIDERKNS2_11CommandInfoERKNSt3__112basic_stringIcNSC_11char_traitsIcEENSC_9allocatorIcEEEERK6OptionISI_ERKNS2_7SlaveIDES6_S9_SI_SM_SP_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSW_FSU_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_ENKUlPNS_11ProcessBaseEE_clES1D_
>     @        0x1138f18cf 
> _ZNSt3__110__function6__funcIZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERKNS5_11CommandInfoERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEERK6OptionISK_ERKNS5_7SlaveIDES9_SC_SK_SO_SR_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSY_FSW_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_EUlPNS2_11ProcessBaseEE_NSI_IS1G_EEFvS1F_EEclEOS1F_
>     @        0x1143768cf std::__1::function<>::operator()()
>     @        0x11435ca7f process::ProcessBase::visit()
>     @        0x1143ed6fe process::DispatchEvent::visit()
>     @        0x1127aaaa1 process::ProcessBase::serve()
>     @        0x114343b4e process::ProcessManager::resume()
>     @        0x1143431ca process::internal::schedule()
>     @        0x1143da646 _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEEEEEPvS5_
>     @     0x7fff95090268 _pthread_body
>     @     0x7fff950901e5 _pthread_start
>     @     0x7fff9508e41d thread_start
> Failed to synchronize with slave (it's probably exited)
> make[3]: *** [check-local] Segmentation fault: 11
> make[2]: *** [check-am] Error 2
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> {code}
> This was encountered just once out of 3+ {{make check}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to