[ 
https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392056#comment-14392056
 ] 

Sriram Subramanian commented on SAMZA-622:
------------------------------------------

I think that is where Samza differs. Apart from just having great raw 
throughput, you would need to consider if you need exactly once semantics or 
not in the future. Even if you don't in your case, we would need to think from 
the framework's perspective in the long term (if we are adding this support to 
the framework). Say, we have a config in Samza that specifies if the state has 
to be written to HDFS or not. The guarantees provided by the framework should 
not change because of that. The exactly once guarantee would depend on writing 
the input offsets, state and output in one transaction and this would be 
achieved with transactions support in Kafka. Having just the state in HDFS will 
make this not possible. My concern here is that, if this is going to be a 
support we add to Samza, we need to provide the same guarantees. 


> Persisting Samza State on HDFS
> ------------------------------
>
>                 Key: SAMZA-622
>                 URL: https://issues.apache.org/jira/browse/SAMZA-622
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local 
> rocksdb kv store.. 
> It would be nice to save this onto HDFS directly for the following reasons 
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved 
> by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly 
> about balancing disk usage across different tiers (I don't know what the 
> right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other 
> processing systems in the Hadoop land. 
> Rocksdb seems to have an option to store files onto HDFS 
> https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with 
> this). 
> Context: I am working on producing compacted DB snapshots on HDFS for 
> spark/MR jobs to use and thus super interested in this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to