[
https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487591#comment-14487591
]
Vinoth Chandar commented on SAMZA-622:
--------------------------------------
Sorry for the delayed response. While I empathize with not shoehorning things
into a framework, the practical reality of building systems is that sometimes
they offer different guarantees when setup differently.
>> This approach is fine for a shot term hacky approach.
I feel these discussions are premature. Let me do more groundwork, put up a
proposal. Then we can discuss in detail about what guarantees are met, what are
n't .. Then we can make a call.. I think most of what you mentioned is captured
in : https://issues.apache.org/jira/browse/SAMZA-72 (and its out links). I will
take into account these.
> Persisting Samza State on HDFS
> ------------------------------
>
> Key: SAMZA-622
> URL: https://issues.apache.org/jira/browse/SAMZA-622
> Project: Samza
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local
> rocksdb kv store..
> It would be nice to save this onto HDFS directly for the following reasons
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved
> by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly
> about balancing disk usage across different tiers (I don't know what the
> right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other
> processing systems in the Hadoop land.
> Rocksdb seems to have an option to store files onto HDFS
> https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with
> this).
> Context: I am working on producing compacted DB snapshots on HDFS for
> spark/MR jobs to use and thus super interested in this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)