Ernestas Vaiciukevičius created STORM-1043:
----------------------------------------------
Summary: Concurrent access to state on local FS by multiple
supervisors
Key: STORM-1043
URL: https://issues.apache.org/jira/browse/STORM-1043
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.9.5
Reporter: Ernestas Vaiciukevičius
Hi,
we are running storm-mesos cluster and occassionaly workers die or are "lost"
in mesos. When this happens it often coincides with errors in logs related to
supervisors local state.
By looking at the storm code it seems this might be caused by the way how
multiple supervisor processes access the local state in the same directory via
VersionedStore.
For example:
https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
Here every supervisor does this concurrently:
1. reads latest state from FS
2. possibly updates the state
3. writes the new version of the state
Some updates could be lost if there are 2+ supervisors and they execute above
steps concurrently - then only the updates from last supervisor would remain on
the last state version on the disk.
We observed local state changes quite often (seconds), so the likelihood of
this concurrency issue occurring is high.
Some examples of exeptions:
------------------------------------------
java.lang.RuntimeException: Version already exists or data already exists
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.persist(LocalState.java:101)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:82)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:76)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40)
~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
---------------------------------------
java.io.FileNotFoundException: File
'/var/lib/storm/supervisor/localstate/1441034838231' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299)
~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763)
~[commons-io-2.4.jar:2.4]
at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:47)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.get(LocalState.java:72)
~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234)
~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40)
~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
-----------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)