[ 
https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364360#comment-14364360
 ] 

Joris Van Remoortere commented on MESOS-2161:
---------------------------------------------

We have figured out what the issue is. [~nnielsen] has a quick patch that 
almost entirely removes the problem, and we will have a follow up patch that 
fully fixes this issue (the delay is to ensure we maintain version upgrade 
compatibility).

The pattern that is being used to manage the memory allocated in JNI is to map 
the java 'finalize' function to a clean-up function in C++ that calls the 
object destructor. This pattern relies on proper reference counting of the java 
object, in this case the 'Future'.

*In 0.21 and before, there was a subtle mistake:*
+_What we thought was happening:_+ The 'Future' was being kept alive during the 
invocation of the JNI calls, because the JNI calls took a 'thiz' object
+_What was actually happening:_+ The 'thiz' jobject was actually referring to 
the outer class 'AbstractState' because that is where the member functions are 
defined.
+_Why this was around for so long / hard to catch:_+ The inter-play (race) 
between the garbage collector and completion time of the 
'__fetch_get_timeout()' routine meant that only under sufficient pressure would 
there be a garbage collection that triggered a 'finalize()' on a Future that 
was still being 'waited' on.

*The quick fix:*
Take a copy (_as opposed to keeping a pointer_) of the future upon entering the 
JNI function, so that even if the garbage collector finalizes the 'Future' 
during the execution of the JNI function, there is still a valid C++ reference 
count on the C++ 'Future' object. The strategy for this is the minimize the 
surface area of the bug such that only a garbage collection between function 
invocation and the copy would cause a problem. Prior to this fix, due to the 
blocking nature of the 'Future::get()' functions, there was ample surface area.

*The long term fix:*
Pass the correct jobject to the JNI function such that the object is not 
finalized prematurely. This requires some massaging of the API, and is more 
complicated due to our upgrade compatibility requirements.

I will post a later review for the long term fix.

> AbstractState JNI check fails for Marathon framework
> ----------------------------------------------------
>
>                 Key: MESOS-2161
>                 URL: https://issues.apache.org/jira/browse/MESOS-2161
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>         Environment: Mesos 0.21.0
> Marathon 0.7.5
> Fedora 20
>            Reporter: Matthew Sanders
>            Assignee: Joris Van Remoortere
>         Attachments: mesos_core_dump_gdb.txt.bz2
>
>
> I've recently upgraded to mesos 0.21.0 and now it seems that every few 
> minutes or so I see the following error, which kills marathon. 
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
> 18:12:42,064] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:41 +0000] "GET 
> /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/"; "Mozilla/5.0 (X11; Linux 
> x86_64; rv:31.0) Gecko/20100101 Firefox/31.0" 
> (mesosphere.chaos.http.ChaosRequestLog:15)
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
> 18:12:42,238] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:42 +0000] "GET 
> /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/"; "Mozilla/5.0 (X11; 
> Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0" 
> (mesosphere.chaos.http.ChaosRequestLog:15)
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
> 18:12:42,961] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +0000] "GET 
> /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/"; "Mozilla/5.0 (Macintosh; 
> Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
> Chrome/39.0.2171.65 Safari/537...
> Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
> 18:12:43,032] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +0000] "GET 
> /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/"; "Mozilla/5.0 
> (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
> Chrome/39.0.2171.65 Safari...
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 
> 18:12:44.146260  5897 check.hpp:79] Check failed: f.isReady()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check 
> failure stack trace: ***
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a2b17c  google::LogMessage::Fail()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a2b0d5  google::LogMessage::SendToLog()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a2aab3  google::LogMessage::Flush()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a2da3b  google::LogMessageFatal::~LogMessageFatal()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a1ea64  _checkReady<>()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f8176a1d43b  Java_org_apache_mesos_state_AbstractState__1_1names_1get
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @     
> 0x7f81f644ca70  (unknown)
> Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: 
> main process exited, code=killed, status=6/ABRT
> Here's the command that mesos-master is being run with
> /usr/local/sbin/mesos-master 
> --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
>  --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos
> Here's the command that the slave is running with:
> /usr/local/sbin/mesos-slave 
> --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
>  --log_dir=/var/log/mesos --containerizers=docker,mesos 
> --executor_registration_timeout=5mins 
> --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago
> I realize this could also be filed to marathon, but it sort of looks like a 
> c++ issue to me, which is why I came here to post this. Any help would be 
> greatly appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to