Robert Joseph Evans created STORM-411:
-----------------------------------------
Summary: Extend file uploads to support more distributed cache
like semantics
Key: STORM-411
URL: https://issues.apache.org/jira/browse/STORM-411
Project: Apache Storm (Incubating)
Issue Type: Improvement
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
One of the big features that we are asked about for a hosted storm instance is
how to distribute and update large shared data sets with topologies. These
could be things like ip to geolocation tables, machine learned models or just
about anything else.
Currently with storm you either have to package it as part of your topology
jar, install in on the machine already, or access an external service to pull
the data down. Packaging it in the jar does not allow users to update the
dataset without restarting their topologies, installing it on the machine will
not work for a hosted storm solution, and pulling it form an external service
without the supervisors being aware of it would mean it would be downloaded
multiple times, and may not be cleaned up properly afterwards.
I propose that instead we setup something similar to the distributed cache on
Hadoop, but with a pluggable backend. The APIs would be for a simple blobstore
so they could be backed by local disk on nimbus, HDFS, swift, or even
bittorrent.
Adding new "files" to the blob store or downloading them would by default go
through nimbus, but if an external store is properly configured direct access
into the store could be used.
The worker process would access the files through symlinks in the current
working directory of the worker. For posix systems when a new version of the
file is made available the symlink would atomically be replaced by a new one
pointing to the new version. Windows does not support atomic replacement of a
symlink so we should provide a simple library that will return resolved paths
to be used, and can detect when the links have changed, but have some retry
logic built in, if the symlink disappears in the middle.
We are in the early stages of implementing this functionality and would like
some feedback on the concepts before getting too far along.
--
This message was sent by Atlassian JIRA
(v6.2#6252)