[
https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391490#comment-14391490
]
Sriram Subramanian commented on SAMZA-622:
------------------------------------------
Is trying to write to HDFS using RocksDB the best approach to have the state in
HDFS? Using the changelog stream from Kafka and dumping it into HDFS would
ensure we could still have pretty good state performance locally. No?
> Persisting Samza State on HDFS
> ------------------------------
>
> Key: SAMZA-622
> URL: https://issues.apache.org/jira/browse/SAMZA-622
> Project: Samza
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local
> rocksdb kv store..
> It would be nice to save this onto HDFS directly for the following reasons
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved
> by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly
> about balancing disk usage across different tiers (I don't know what the
> right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other
> processing systems in the Hadoop land.
> Rocksdb seems to have an option to store files onto HDFS
> https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with
> this).
> Context: I am working on producing compacted DB snapshots on HDFS for
> spark/MR jobs to use and thus super interested in this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)