Robert Joseph Evans created STORM-411:
-----------------------------------------

             Summary: Extend file uploads to support more distributed cache 
like semantics
                 Key: STORM-411
                 URL: https://issues.apache.org/jira/browse/STORM-411
             Project: Apache Storm (Incubating)
          Issue Type: Improvement
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans


One of the big features that we are asked about for a hosted storm instance is 
how to distribute and update large shared data sets with topologies.  These 
could be things like ip to geolocation tables, machine learned models or just 
about anything else.

Currently with storm you either have to package it as part of your topology 
jar, install in on the machine already, or access an external service to pull 
the data down.  Packaging it in the jar does not allow users to update the 
dataset without restarting their topologies, installing it on the machine will 
not work for a hosted storm solution, and pulling it form an external service 
without the supervisors being aware of it would mean it would be downloaded 
multiple times, and may not be cleaned up properly afterwards.

I propose that instead we setup something similar to the distributed cache on 
Hadoop, but with a pluggable backend.  The APIs would be for a simple blobstore 
so they could be backed by local disk on nimbus, HDFS, swift, or even 
bittorrent.

Adding new "files" to the blob store or downloading them would by default go 
through nimbus, but if an external store is properly configured direct access 
into the store could be used.

The worker process would access the files through symlinks in the current 
working directory of the worker.  For posix systems when a new version of the 
file is made available the symlink would atomically be replaced by a new one 
pointing to the new version.  Windows does not support atomic replacement of a 
symlink so we should provide a simple library that will return resolved paths 
to be used, and can detect when the links have changed, but have some retry 
logic built in, if the symlink disappears in the middle.

We are in the early stages of implementing this functionality and would like 
some feedback on the concepts before getting too far along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to