[ 
https://issues.apache.org/jira/browse/STORM-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056095#comment-15056095
 ] 

ASF GitHub Bot commented on STORM-1372:
---------------------------------------

Github user unsleepy22 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/945#discussion_r47508932
  
    --- Diff: docs/documentation/distcache-blobstore.md ---
    @@ -0,0 +1,733 @@
    +# Storm Distributed Cache API
    +
    +The distributed cache feature in storm is used to efficiently distribute 
files
    +(or blobs, which is the equivalent terminology for a file in the 
distributed
    +cache and is used interchangeably in this document) that are large and can
    +change during the lifetime of a topology, such as geo-location data,
    +dictionaries, etc. Typical use cases include phrase recognition, entity
    +extraction, document classification, URL re-writing, location/address 
detection
    +and so forth. Such files may be several KB to several GB in size. For small
    +datasets that don't need dynamic updates, including them in the topology 
jar
    +could be fine. But for large files, the startup times could become very 
large.
    +In these cases, the distributed cache feature can provide fast topology 
startup,
    +especially if the files were previously downloaded for the same submitter 
and
    +are still in the cache. This is useful with frequent deployments, 
sometimes few
    +times a day with updated jars, because the large cached files will remain 
available
    +without changes. The large cached blobs that do not change frequently will
    +remain available in the distributed cache.
    +
    +At the starting time of a topology, the user specifies the set of files the
    +topology needs. Once a topology is running, the user at any time can 
request for
    +any file in the distributed cache to be updated with a newer version. The
    +updating of blobs happens in an eventual consistency model. If the topology
    +needs to know what version of a file it has access to, it is the 
responsibility
    +of the user to find this information out. The files are stored in a cache 
with
    +Least-Recently Used (LRU) eviction policy, where the supervisor decides 
which
    +cached files are no longer needed and can delete them to free disk space. 
The
    +blobs can be compressed, and the user can request the blobs to be 
uncompressed
    +before it accesses them.
    +
    +## Motivation for Distributed Cache
    +* Allows sharing blobs among topologies.
    +* Allows updating the blobs from the command line.
    +
    +## Distributed Cache Implementations
    +The current BlobStore interface has the following two implementations
    +* LocalFsBlobStore
    +* HdfsBlobStore
    +
    +Appendix A contains the interface for blob store implementation.
    +
    +## LocalFsBlobStore
    +![LocalFsBlobStore](images/local_blobstore.png)
    +
    +Local file system implementation of Blobstore can be depicted in the above 
timeline diagram.
    +
    +There are several stages from blob creation to blob download and 
corresponding execution of a topology. 
    +The main stages can be depicted as follows
    +
    +### Blob Creation Command
    +Blobs in the blobstore can be created through command line using the 
following command.
    +storm blobstore create --file README.txt --acl o::rwa --repl-fctr 4 key1
    +The above command creates a blob with a key name “key1” corresponding to 
the file README.txt. 
    +The access given to all users being read, write and admin with a 
replication factor of 4.
    +
    +### Topology Submission and Blob Mapping
    +Users can submit their topology with the following command. The command 
includes the 
    +topology map configuration. The configuration holds two keys “key1” and 
“key2” with the 
    +key “key1” having a local file name mapping named “blob_file” and it is 
not compressed.
    +
    +```
    +storm jar 
/home/y/lib/storm-starter/current/storm-starter-jar-with-dependencies.jar 
    +storm.starter.clj.word_count test_topo -c 
topology.blobstore.map='{"key1":{"localname":"blob_file", 
"uncompress":"false"},"key2":{}}'
    +```
    +
    +### Blob Creation Process
    +The creation of the blob takes place through the interface 
“ClientBlobStore”. Appendix B contains the “ClientBlobStore” interface. 
    +The concrete implementation of this interface is the  “NimbusBlobStore”. 
In the case of local file system the client makes a 
    +call to the nimbus to create the blobs within the local file system. The 
nimbus uses the local file system implementation to create these blobs. 
    +When a user submits a topology, the jar, configuration and code files are 
uploaded as blobs with the help of blob store. 
    +Also, all the other blobs specified by the topology are mapped to it with 
the help of topology.blobstore.map configuration.
    +
    +### Blob Download by the Supervisor
    +Finally, the blobs corresponding to a topology are downloaded by the 
supervisor once it receives the assignments from the nimbus through 
    +the same “NimbusBlobStore” thrift client that uploaded the blobs. The 
supervisor downloads the code, jar and conf blobs by calling the 
    +“NimbusBlobStore” client directly while the blobs specified in the 
topology.blobstore.map are downloaded and mapped locally with the help 
    +of the Localizer. The Localizer talks to the “NimbusBlobStore” thrift 
client to download the blobs and adds the blob compression and local 
    +blob name mapping logic to suit the implementation of a topology. Once all 
the blobs have been downloaded the workers are launched to run 
    +the topologies.
    +
    +## HdfsBlobStore
    +![HdfsBlobStore](images/hdfs_blobstore.png)
    +
    +The HdfsBlobStore functionality has a similar implementation and blob 
creation and download procedure barring how the replication 
    +is handled in the two blob store implementations. The replication in HDFS 
blob store is obvious as HDFS is equipped to handle replication 
    +and it requires no state to be stored inside the zookeeper. On the other 
hand, the local file system blobstore requires the state to be 
    +stored on the zookeeper in order for it to work with nimbus HA. Nimbus HA 
allows the local filesystem to implement the replication feature 
    +seamlessly by storing the state in the zookeeper about the running 
topologies and syncing the blobs on various nimbodes. On the supervisor’s 
    --- End diff --
    
    nimbode?


> Update BlobStore Documentation - Follow up STORM-876
> ----------------------------------------------------
>
>                 Key: STORM-1372
>                 URL: https://issues.apache.org/jira/browse/STORM-1372
>             Project: Apache Storm
>          Issue Type: Story
>            Reporter: Sanket Reddy
>            Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to