[
https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364360#comment-14364360
]
Joris Van Remoortere commented on MESOS-2161:
---------------------------------------------
We have figured out what the issue is. [~nnielsen] has a quick patch that
almost entirely removes the problem, and we will have a follow up patch that
fully fixes this issue (the delay is to ensure we maintain version upgrade
compatibility).
The pattern that is being used to manage the memory allocated in JNI is to map
the java 'finalize' function to a clean-up function in C++ that calls the
object destructor. This pattern relies on proper reference counting of the java
object, in this case the 'Future'.
*In 0.21 and before, there was a subtle mistake:*
+_What we thought was happening:_+ The 'Future' was being kept alive during the
invocation of the JNI calls, because the JNI calls took a 'thiz' object
+_What was actually happening:_+ The 'thiz' jobject was actually referring to
the outer class 'AbstractState' because that is where the member functions are
defined.
+_Why this was around for so long / hard to catch:_+ The inter-play (race)
between the garbage collector and completion time of the
'__fetch_get_timeout()' routine meant that only under sufficient pressure would
there be a garbage collection that triggered a 'finalize()' on a Future that
was still being 'waited' on.
*The quick fix:*
Take a copy (_as opposed to keeping a pointer_) of the future upon entering the
JNI function, so that even if the garbage collector finalizes the 'Future'
during the execution of the JNI function, there is still a valid C++ reference
count on the C++ 'Future' object. The strategy for this is the minimize the
surface area of the bug such that only a garbage collection between function
invocation and the copy would cause a problem. Prior to this fix, due to the
blocking nature of the 'Future::get()' functions, there was ample surface area.
*The long term fix:*
Pass the correct jobject to the JNI function such that the object is not
finalized prematurely. This requires some massaging of the API, and is more
complicated due to our upgrade compatibility requirements.
I will post a later review for the long term fix.
> AbstractState JNI check fails for Marathon framework
> ----------------------------------------------------
>
> Key: MESOS-2161
> URL: https://issues.apache.org/jira/browse/MESOS-2161
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.21.0
> Environment: Mesos 0.21.0
> Marathon 0.7.5
> Fedora 20
> Reporter: Matthew Sanders
> Assignee: Joris Van Remoortere
> Attachments: mesos_core_dump_gdb.txt.bz2
>
>
> I've recently upgraded to mesos 0.21.0 and now it seems that every few
> minutes or so I see the following error, which kills marathon.
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25
> 18:12:42,064] INFO 10.133.128.26 - - [26/Nov/2014:00:12:41 +0000] "GET
> /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/" "Mozilla/5.0 (X11; Linux
> x86_64; rv:31.0) Gecko/20100101 Firefox/31.0"
> (mesosphere.chaos.http.ChaosRequestLog:15)
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25
> 18:12:42,238] INFO 10.133.128.26 - - [26/Nov/2014:00:12:42 +0000] "GET
> /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/" "Mozilla/5.0 (X11;
> Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0"
> (mesosphere.chaos.http.ChaosRequestLog:15)
> Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25
> 18:12:42,961] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +0000] "GET
> /v2/apps HTTP/1.1" 200 2321 "http://marathon:8080/" "Mozilla/5.0 (Macintosh;
> Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/39.0.2171.65 Safari/537...
> Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25
> 18:12:43,032] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +0000] "GET
> /v2/deployments HTTP/1.1" 200 2 "http://marathon:8080/" "Mozilla/5.0
> (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/39.0.2171.65 Safari...
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125
> 18:12:44.146260 5897 check.hpp:79] Check failed: f.isReady()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check
> failure stack trace: ***
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a2b17c google::LogMessage::Fail()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a2b0d5 google::LogMessage::SendToLog()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a2aab3 google::LogMessage::Flush()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a2da3b google::LogMessageFatal::~LogMessageFatal()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a1ea64 _checkReady<>()
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f8176a1d43b Java_org_apache_mesos_state_AbstractState__1_1names_1get
> Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @
> 0x7f81f644ca70 (unknown)
> Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service:
> main process exited, code=killed, status=6/ABRT
> Here's the command that mesos-master is being run with
> /usr/local/sbin/mesos-master
> --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
> --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos
> Here's the command that the slave is running with:
> /usr/local/sbin/mesos-slave
> --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
> --log_dir=/var/log/mesos --containerizers=docker,mesos
> --executor_registration_timeout=5mins
> --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago
> I realize this could also be filed to marathon, but it sort of looks like a
> c++ issue to me, which is why I came here to post this. Any help would be
> greatly appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)