[
https://issues.apache.org/jira/browse/STORM-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055163#comment-15055163
]
ASF GitHub Bot commented on STORM-1372:
---------------------------------------
Github user zhuoliu commented on a diff in the pull request:
https://github.com/apache/storm/pull/945#discussion_r47450390
--- Diff: docs/documentation/distcache-blobstore.md ---
@@ -0,0 +1,734 @@
+# Storm Distributed Cache API
+
+The distributed cache feature in storm is used to efficiently distribute
files
+(or blobs, which is the equivalent terminology for a file in the
distributed
+cache and is used interchangeably in this document) that are large and can
+change during the lifetime of a topology, such as geo-location data,
+dictionaries, etc. Typical use cases include phrase recognition, entity
+extraction, document classification, URL re-writing, location/address
detection
+and so forth. Such files may be several KB to several GB in size. For small
+datasets that don't need dynamic updates, including them in the topology
jar
+could be fine. But for large files, the startup times could become very
large.
+In these cases, the distributed cache feature can provide fast topology
startup,
+especially if the files were previously downloaded for the same submitter
and
+are still in the cache. This is useful with frequent deployments, sometime
a few
+a day with updated jars, because the large cached files will remain
available
+without changes. The large cached blobs that do not change frequently will
+remain available in the distributed cache.
+
+At the starting time of a topology, the user specifies the set of files the
+topology needs. Once a topology is running, the user at any time can
request for
+any file in the distributed cache to be updated with a newer version. The
+updating of blobs happens in an eventual consistency model. If the topology
+needs to know what version of a file it has access to, it is the
responsibility
+of the user to find this information out. The files are stored in a cache
with
+Least-Recently Used (LRU) eviction policy, where the supervisor decides
which
+cached files are no longer needed and can delete them to free disk space.
The
+blobs can be compressed, and the user can request the blobs to be
uncompressed
+before it accesses them.
+
+## Motivation for Distributed Cache
+* Allows sharing blobs among topologies.
+* Allows updating the blobs from the command line.
+
+## Distributed Cache Implementations
+The current BlobStore interface has the following two implementations
+* LocalFsBlobStore
+* HdfsBlobStore
+
+Appendix A contains the interface for blob store implementation.
+
+## LocalFsBlobStore
+
+
+Local file system implementation of Blobstore can be depicted in the above
timeline diagram.
+
+There are several stages from blob creation to blob download and
corresponding execution of a topology.
+The main stages can be depicted as follows
+
+### Blob Creation Command
+Blobs in the blobstore can be created through command line using the
following command.
+storm blobstore create --file README.txt --acl o::rwa --repl-fctr 4 key1
+The above command creates a blob with a key name “key1” corresponding to
the file README.txt.
+The access given to all users being read, write and admin with a
replication factor of 4.
+
+### Topology Submission and Blob Mapping
+Users can submit their topology with the following command. The command
includes the
+topology map configuration. The configuration holds two keys “key1” and
“key2” with the
+key “key1” having a local file name mapping named “blob_file” and it is
not compressed.
+
+```
+storm jar
/home/y/lib/storm-starter/current/storm-starter-jar-with-dependencies.jar
+storm.starter.clj.word_count test_topo -c
topology.blobstore.map='{"key1":{"localname":"blob_file",
"uncompress":"false"},"key2":{}}'
+```
+
+### Blob Creation Process
+The creation of the blob takes place through the interface
“ClientBlobStore”. Appendix B contains the “ClientBlobStore” interface.
+The concrete implementation of this interface is the “NimbusBlobStore”.
In the case of local file system the client makes a
+call to the nimbus to create the blobs within the local file system. The
nimbus uses the local file system implementation to create these blobs.
+When a user submits a topology, the jar, configuration and code files are
uploaded as blobs with the help of blob store.
+Also, all the other blobs specified by the topology are mapped to it with
the help of topology.blobstore.map configuration.
+
+### Blob Download by the Supervisor
+Finally, the blobs corresponding to a topology are downloaded by the
supervisor once it receives the assignments from the nimbus through
+the same “NimbusBlobStore” thrift client that uploaded the blobs. The
supervisor downloads the code, jar and conf blobs by calling the
+“NimbusBlobStore” client directly while the blobs specified in the
topology.blobstore.map are downloaded and mapped locally with the help
+of the Localizer. The Localizer talks to the “NimbusBlobStore” thrift
client to download the blobs and adds the blob compression and local
+blob name mapping logic to suit the implementation of a topology. Once all
the blobs have been downloaded the workers are launched to run
+the topologies.
+
+## HdfsBlobStore
+
+
+The HdfsBlobStore functionality has a similar implementation and blob
creation and download procedure barring how the replication
+is handled in the two blob store implementations. The replication in HDFS
blob store is obvious as HDFS is equipped to handle replication
+and it requires no state to be stored inside the zookeeper. On the other
hand, the local file system blobstore requires the state to be
+stored on the zookeeper in order for it to work with nimbus HA. Nimbus HA
allows the local filesystem to implement the replication feature
+seamlessly by storing the state in the zookeeper about the running
topologies and syncing the blobs on various nimbodes. On the supervisor’s
+end, the supervisor and localizer talks to HdfsBlobStore through
“HdfsClientBlobStore” implementation.
+
+## Additional Features and Documentation
+```
+storm jar
/home/y/lib/storm-starter/current/storm-starter-jar-with-dependencies.jar
storm.starter.clj.word_count test_topo
+-c topology.blobstore.map='{"key1":{"localname":"blob_file",
"uncompress":"false"},"key2":{}}'
+```
+
+### Compression
+The blob store allows the user to specify the “uncompress” configuration
to true or false. This configuration can be specified
+in the topology.blobstore.map mentioned in the above command. This allows
the user to upload a compressed file like a tarball/zip.
+In local file system blob store, the compressed blobs are stored on the
nimbus node. The localizer code takes the responsibility to
+uncompress the blob and store it on the supervisor node. Symbolic links to
the blobs on the supervisor node are created within the worker
+before the execution starts.
+
+### Local File Name Mapping
+Apart from compression the blobstore helps to give the blob a name that
can be used by the workers. The localizer takes
+the responsibility of mapping the blob to a local name on the supervisor
node.
+
+## Additional Blob Store Implementation Details
+Blob store uses a hashing function to create the blobs based on the key.
The blobs are generally stored inside the directory specified by
+the blobstore.dir configuration. By default, it is stored under
“storm.local.dir/nimbus/blobs” for local file system and a similar path on
+hdfs file system.
+
+Once a file is submitted, the blob store reads the configs and creates a
metadata for the blob with all the access control details. The metadata
+is generally used for authorization while accessing the blobs. The blob
key and version contribute to the hash code and there by the directory
+under “storm.local.dir/nimbus/blobs/data” where the data is placed. The
blobs are generally placed in a positive number directory like 193,822 etc.,
--- End diff --
May remove the last ","
> Update BlobStore Documentation - Follow up STORM-876
> ----------------------------------------------------
>
> Key: STORM-1372
> URL: https://issues.apache.org/jira/browse/STORM-1372
> Project: Apache Storm
> Issue Type: Story
> Reporter: Sanket Reddy
> Priority: Minor
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)