[ 
https://issues.apache.org/jira/browse/SAMZA-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393090#comment-14393090
 ] 

Chinmay Soman commented on SAMZA-622:
-------------------------------------

I think you're right from the standpoint that "getting exactly-once is 
difficult" - period. I think what you're really saying is getting that to work 
is way way more difficult (if not impossible) with HDFS than with an embedded 
rocks db store. There are some issues even with an embedded store that we have 
to deal with.

Having said that - having the ability to write to HDFS is fantastic. Given that 
a lot of companies need to do this anyways (mostly for ETL). In addition, this 
solves a lot of the global state / state sharing problems. We can maybe give a 
disclaimer saying "this approach does not guarantee blah blah"


> Persisting Samza State on HDFS
> ------------------------------
>
>                 Key: SAMZA-622
>                 URL: https://issues.apache.org/jira/browse/SAMZA-622
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>
> Samza's state currently lives in Kafka as a change log (compacted) and local 
> rocksdb kv store.. 
> It would be nice to save this onto HDFS directly for the following reasons 
> - HDFS is a fault tolerant FS. Thus, restarting Samza tasks can be achieved 
> by locating the task to where the other copies are.
> - HDFS virtualizes storage and thus, one would not have to worry explicitly 
> about balancing disk usage across different tiers (I don't know what the 
> right word is) in a data flow graph
> - Storing the state in HDFS, makes it easier to share this with other 
> processing systems in the Hadoop land. 
> Rocksdb seems to have an option to store files onto HDFS 
> https://github.com/facebook/rocksdb/tree/master/hdfs (Has someone played with 
> this). 
> Context: I am working on producing compacted DB snapshots on HDFS for 
> spark/MR jobs to use and thus super interested in this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to