[ 
https://issues.apache.org/jira/browse/SAMZA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123095#comment-14123095
 ] 

Chris Riccomini commented on SAMZA-402:
---------------------------------------

bq. If we want to support use cases where a batch job pushes a new version of 
the state that completely replaces the old version, then we would probably need 
atomic swaps and handling of deletions. For that reason, I'm inclined to not 
support such batch updates of shared state. Batch-updated state can continue to 
use Voldemort.

Yea, I was thinking along this line as well. If you want atomic swaps, you'll 
have to use a remote store.

bq. In my opinion, the key-value interface for a shared store should not permit 
writing (calling put() should raise an exception), to avoid setting false 
expectations of synchronous updates and magical distributed consistency.

Agreed. I think this was the main difference between my latest proposal, and 
yours. I was saying that put() should write to the DB immediately. I've come to 
realize that it's not easily implementable, and pretty confusing, so I agree 
with what you're saying here.

bq. On SAMZA-353 we discussed whether the StreamTask should be notified about 
changes in the store. I now think that probably isn't necessary, at least for a 
first version.

I was planning to punt on this as well. We could always add some callback or 
something like that later.

bq. In summary: just because certain use cases can't easily be satisfied, we 
shouldn't throw the baby out with the bathwater. I think we should implement a 
simple version of shared state which is read-only and which only supports 
single-key updates (no batch updates, no atomic switching), like you describe 
in the implementation section of the design doc. That would already be very 
useful, and leave open our options to support more use cases in future.

I guess the question is: does this implementation provide enough of a benefit 
to be worth doing, vs. just a remote store with a local cache? The two 
arguments that I can come up with are:

# Operational complexity of running a remote store.
# Performance will be better if there are no remote queries.

The deciding factor to me on which approach is actually going to be "better" 
for a Samza job is whether the state that it needs is *already* in a DB. If 
it's already in a DB, and has to continue to remain there for other reasons, 
then there is complexity in setting up a change log and having the Samza job 
consume the state (vs. just querying it). If it's not, then the global state 
solution seems preferable (since the data is probably coming from a Hadoop 
push).

It kind of feels like there are two use cases here:

# Primary data exists on a remote DB and is being used by other stuff (e.g. 
front ends) in addition to the Samza job.
# Derived data is computed offline, and needs to be pushed somewhere for the 
Samza job to use.

For (1), remote DB with cache seems better. For (2), I think the global store 
is better.

> Provide a "shared state" store among StreamTasks
> ------------------------------------------------
>
>                 Key: SAMZA-402
>                 URL: https://issues.apache.org/jira/browse/SAMZA-402
>             Project: Samza
>          Issue Type: Bug
>          Components: container, kv
>    Affects Versions: 0.8.0
>            Reporter: Chris Riccomini
>         Attachments: DESIGN-SAMZA-402-0.md, DESIGN-SAMZA-402-0.pdf
>
>
> There has been a lot of discussion about shared state stores in SAMZA-353. 
> Initially, it seemed as though we might implement them through SAMZA-353, but 
> now it seems more preferable to implement them separately. As such, this 
> ticket is to discuss global state/shared state (terms that are being used 
> interchangeably) between StreamTasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to