Well written doc. I think I can understand every point even with limited 
knowledge of Samza :)

A few comments:

1) stateful_job.png

Could you make the changelog streams into the box of tasks to demo they are 
collocated? And could we also add the KV stores into the box? At first glance I 
thought the changelog is actually the store. Also the name "changelog stream" 
is a little confusing. At first I was wondering why this is called a stream and 
would it be just a "log". I only get this while reading the fault tolerance 
section.

2) "This does not impact consistency—a task always reads what it wrote (since 
it checks the cache first)"

I think another main reason is that a task only read/write its own partition.

3) Do we need guarantee consistency between the incoming stream and the change 
log stream. For example, the change log should indicate that this change entry 
is for the state processed until offset X of the incoming stream so that on 
failover we know where we should set the offset of the incoming stream?

4) Would reading directly from LevelDB a better approach than replaying the 
whole (although compacted) changelog? Writes to LevelDB should be persistent, 
and we only need to make sure we have checkpoints, for example, the current 
state (excluding writes that are still in cache) represent all the changes till 
the offset of the changelog?

Guozhang



________________________________________
From: Jay Kreps [[email protected]]
Sent: Thursday, September 05, 2013 5:14 PM
To: [email protected]
Subject: state management docs

I took a pass at improving the state management documentation (talking to
people, I don't think anyone understood what we were saying):
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html

I would love to get some feedback on this, especially from anyone who
doesn't already know Samza. Does this make any sense? Does it tell you what
you need to know in the right order?

-Jay

Reply via email to