[
https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392056#comment-14392056
]
Sriram Subramanian commented on SAMZA-622:
------------------------------------------
I think that is where Samza differs. Apart from just having great raw
throughput, you would need to consider if you need exactly once semantics or
not in the future. Even if you don't in your case, we would need to think from
the framework's perspective in the long term (if we are adding this support to
the framework). Say, we have a config in Samza that specifies if the state has
to be written to HDFS or not. The guarantees provided by the framework should
not change because of that. The exactly once guarantee would depend on writing
the input offsets, state and output in one transaction and this would be
achieved with transactions support in Kafka. Having just the state in HDFS will
make this not possible. My concern here is that, if this is going to be a
support we add to Samza, we need to provide the same guarantees.
> Persisting Samza State on HDFS
> ------------------------------
>
> Key: SAMZA-622
> URL: https://issues.apache.org/jira/browse/SAMZA-622
> Project: Samza
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local
> rocksdb kv store..
> It would be nice to save this onto HDFS directly for the following reasons
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved
> by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly
> about balancing disk usage across different tiers (I don't know what the
> right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other
> processing systems in the Hadoop land.
> Rocksdb seems to have an option to store files onto HDFS
> https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with
> this).
> Context: I am working on producing compacted DB snapshots on HDFS for
> spark/MR jobs to use and thus super interested in this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)