Github user zhuoliu commented on a diff in the pull request:
https://github.com/apache/storm/pull/945#discussion_r47450346
--- Diff: docs/documentation/distcache-blobstore.md ---
@@ -0,0 +1,734 @@
+# Storm Distributed Cache API
+
+The distributed cache feature in storm is used to efficiently distribute
files
+(or blobs, which is the equivalent terminology for a file in the
distributed
+cache and is used interchangeably in this document) that are large and can
+change during the lifetime of a topology, such as geo-location data,
+dictionaries, etc. Typical use cases include phrase recognition, entity
+extraction, document classification, URL re-writing, location/address
detection
+and so forth. Such files may be several KB to several GB in size. For small
+datasets that don't need dynamic updates, including them in the topology
jar
+could be fine. But for large files, the startup times could become very
large.
+In these cases, the distributed cache feature can provide fast topology
startup,
+especially if the files were previously downloaded for the same submitter
and
+are still in the cache. This is useful with frequent deployments, sometime
a few
+a day with updated jars, because the large cached files will remain
available
+without changes. The large cached blobs that do not change frequently will
+remain available in the distributed cache.
+
+At the starting time of a topology, the user specifies the set of files the
+topology needs. Once a topology is running, the user at any time can
request for
+any file in the distributed cache to be updated with a newer version. The
+updating of blobs happens in an eventual consistency model. If the topology
+needs to know what version of a file it has access to, it is the
responsibility
+of the user to find this information out. The files are stored in a cache
with
+Least-Recently Used (LRU) eviction policy, where the supervisor decides
which
+cached files are no longer needed and can delete them to free disk space.
The
+blobs can be compressed, and the user can request the blobs to be
uncompressed
+before it accesses them.
+
+## Motivation for Distributed Cache
+* Allows sharing blobs among topologies.
+* Allows updating the blobs from the command line.
+
+## Distributed Cache Implementations
+The current BlobStore interface has the following two implementations
+* LocalFsBlobStore
+* HdfsBlobStore
+
+Appendix A contains the interface for blob store implementation.
+
+## LocalFsBlobStore
+
+
+Local file system implementation of Blobstore can be depicted in the above
timeline diagram.
+
+There are several stages from blob creation to blob download and
corresponding execution of a topology.
+The main stages can be depicted as follows
+
+### Blob Creation Command
+Blobs in the blobstore can be created through command line using the
following command.
+storm blobstore create --file README.txt --acl o::rwa --repl-fctr 4 key1
+The above command creates a blob with a key name âkey1â corresponding
to the file README.txt.
+The access given to all users being read, write and admin with a
replication factor of 4.
+
+### Topology Submission and Blob Mapping
+Users can submit their topology with the following command. The command
includes the
+topology map configuration. The configuration holds two keys âkey1â
and âkey2â with the
+key âkey1â having a local file name mapping named âblob_fileâ and
it is not compressed.
+
+```
+storm jar
/home/y/lib/storm-starter/current/storm-starter-jar-with-dependencies.jar
+storm.starter.clj.word_count test_topo -c
topology.blobstore.map='{"key1":{"localname":"blob_file",
"uncompress":"false"},"key2":{}}'
+```
+
+### Blob Creation Process
+The creation of the blob takes place through the interface
âClientBlobStoreâ. Appendix B contains the âClientBlobStoreâ interface.
+The concrete implementation of this interface is the
âNimbusBlobStoreâ. In the case of local file system the client makes a
+call to the nimbus to create the blobs within the local file system. The
nimbus uses the local file system implementation to create these blobs.
+When a user submits a topology, the jar, configuration and code files are
uploaded as blobs with the help of blob store.
+Also, all the other blobs specified by the topology are mapped to it with
the help of topology.blobstore.map configuration.
+
+### Blob Download by the Supervisor
+Finally, the blobs corresponding to a topology are downloaded by the
supervisor once it receives the assignments from the nimbus through
+the same âNimbusBlobStoreâ thrift client that uploaded the blobs. The
supervisor downloads the code, jar and conf blobs by calling the
+âNimbusBlobStoreâ client directly while the blobs specified in the
topology.blobstore.map are downloaded and mapped locally with the help
+of the Localizer. The Localizer talks to the âNimbusBlobStoreâ thrift
client to download the blobs and adds the blob compression and local
+blob name mapping logic to suit the implementation of a topology. Once all
the blobs have been downloaded the workers are launched to run
+the topologies.
+
+## HdfsBlobStore
+
+
+The HdfsBlobStore functionality has a similar implementation and blob
creation and download procedure barring how the replication
--- End diff --
barring?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---