[ 
https://issues.apache.org/jira/browse/MESOS-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138336#comment-14138336
 ] 

Chengwei Yang commented on MESOS-1804:
--------------------------------------

# cat /etc/issue
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Kernel \r on an \m

# rpm -qa | grep gcc
libgcc-4.4.7-3.el6.x86_64
gcc-4.4.7-3.el6.x86_64
gcc-c++-4.4.7-3.el6.x86_64

# cat /proc/version 
Linux version 2.6.32-431.el6.x86_64 (root@localhost) (gcc version 4.4.7 
20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Mar 14 11:51:38 CST 2014

It's quite easy to reproduce this bug in below steps.

1. continuously submit job to chronos, with the job has attribute 
"epsilon":"PT10M"
2. chronos will crash about have 3 thousands of job submitted.


> the "store" component cause on-top framework (chronos) crash
> ------------------------------------------------------------
>
>                 Key: MESOS-1804
>                 URL: https://issues.apache.org/jira/browse/MESOS-1804
>             Project: Mesos
>          Issue Type: Bug
>         Environment: mesos-0.19.0
>            Reporter: Chengwei Yang
>            Assignee: Chengwei Yang
>
> chronos running with mesos-0.19.0 may crash like below.
> {code}
> [2014-09-05 15:21:36,095] INFO State J_chronos_job_34 does not exist yet. 
> Adding to state (com.airbnb.scheduler.state.MesosStatePersistenceStore:146)
> F0905 15:21:36.175230 27727 org_apache_mesos_state_AbstractState.cpp:319] 
> Check failed: future->isReady()
> *** Check failure stack trace: ***
> @ 0x7f4f1ecb199d google::LogMessage::Fail()
> @ 0x7f4f1ecb59b7 google::LogMessage::SendToLog()
> @ 0x7f4f1ecb3839 google::LogMessage::Flush()
> @ 0x7f4f1ecb3b3d google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4f1ec2ef90 Java_org_apache_mesos_state_AbstractState__1_1store_1get
> @ 0x7f4f18293d45 (unknown)
> Aborted (core dumped)
> {code}
> The related code snippet as below:
> {code}
> $ sed -ne '311,334p' src/java/jni/org_apache_mesos_state_AbstractState.cpp
> JNIEXPORT jobject JNICALL 
> Java_org_apache_mesos_state_AbstractState__1_1store_1get
>   (JNIEnv* env, jobject thiz, jlong jfuture)
> {
>   Future<Option<Variable> >* future = (Future<Option<Variable> >*) jfuture;
>   future->await();
>   if (future->isFailed()) {
>     jclass clazz = env->FindClass("java/util/concurrent/ExecutionException");
>     env->ThrowNew(clazz, future->failure().c_str());
>     return NULL;
>   } else if (future->isDiscarded()) {
>     // TODO(benh): Consider throwing an ExecutionException since we
>     // never return true for 'isCancelled'.
>     jclass clazz = 
> env->FindClass("java/util/concurrent/CancellationException");
>     env->ThrowNew(clazz, "Future was discarded");
>     return NULL;
>   }
>   CHECK_READY(*future);
>   if (future->get().isSome()) {
>     Variable* variable = new Variable(future->get().get());
> {code}
> The root cause seems that CHECK_READY(*future) failed and crashed chronos.
> See chronos issue: https://github.com/airbnb/chronos/issues/253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to