Hi

I'm using Samza to aggregate data within sliding windows. I'm especially
interested in the fault tolerance aspect of Samza and it's ability to
restore from the latest committed offset. I've looked into the
WindowableTask interface, but this implementation relies on the system
time, rather then the timestamp of the messages received. The main problem
is to make sure all data for one window is available and in case of node
failure the complete window is restored. With the current implementation of
offset committing (committed by time interval, window size or upon
`taskCoordinator.commit()`) this is quite cumbersome. I have two different
options in mind to overcome this problem and would like to hear your advice
on this.

1. Using the Key-Value store the data of a the current window is cached and
after a restore one could figure out what data to include from the cache
and from Kafka. Unfortunately, this sounds like reimplementing the offset
management of Kafka.

2. Set the time based commit interval to Long.MAX_VALUE and handle all
commit intervals through `taskCoordinator.commit()`. In this case I could
commit the offset after each sliding window and in case of failure Samza
would restore from the beginning of the new sliding window. The only
problem with this approach is `taskCoordinator.commit()` will commit the
offsets of all partitions. But the data may be different between partitions
and therefore sliding windows will not match with other partitions. I've
looked into SAMZA-23 [1] and found a proposal to coordinate commits for
each partition individually. Since this would solve my problem, I'd like to
provide a patch for this issue. Is there any interest in this?

Best
Nicolas


[1]: https://issues.apache.org/jira/browse/SAMZA-23

Reply via email to