[ 
https://issues.apache.org/jira/browse/SAMZA-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985539#comment-13985539
 ] 

Martin Kleppmann commented on SAMZA-255:
----------------------------------------

Thanks for this idea. Trying to dig into it a bit further:

For use case 1, it sounds like what you really need is for the checkpoint to 
happen at a point in time determined by the task (namely the start of your 
window). Is that right? I think you can already achieve that by setting 
task.commit.ms to a very large value, and calling coordinator.commit() when you 
want to write a checkpoint.

For use case 2, are you trying to synchronize several input streams/partitions 
based on some timestamp within the messages themselves? You might be able to 
achieve some of what you want by using a custom MessageChooser; if one stream 
is getting ahead of the others, you can hold off consuming it until the other 
streams have got further along.

I'm trying to understand whether you really need this feature, because so far 
we have been keen to keep Samza always consuming messages in the order they 
appear in streams. Allowing tasks to seek within streams would break that 
assumption. It's very nice to be able to assume a strict order of messages per 
partition, because it makes the processing model much easier to reason about, 
makes fault tolerance semantics much more tractable, and allows us to make 
performance optimizations.

For those reasons, we'd have to think very carefully about whether we want to 
allow rewinding streams. It's clearly a powerful feature, perhaps _too_ 
powerful.

> Rewinding Streams within a StreamTask
> -------------------------------------
>
>                 Key: SAMZA-255
>                 URL: https://issues.apache.org/jira/browse/SAMZA-255
>             Project: Samza
>          Issue Type: Wish
>            Reporter: Nicolas Bär
>            Priority: Minor
>
> The many benefits of Kafka include persistent storage and its resulting 
> possibility to rewind streams to a specific offset. Samza does currently not 
> support rewinding of streams within a StreamTask. I'd like to place this 
> functionality as a feature request and provide two use cases to further 
> describe the benefits of such a feature. Let's consider a general use case to 
> aggregate values within sliding windows.
> 1. Offline-Processing
> In case of offline-processing the sliding window does not correlate to the 
> system time. In this case any node failure will result in samza restoring 
> from a checkpointed offset that most probably does not match the beginning of 
> the most recent sliding window. But in order to gain precise results, one 
> could rewind to the specific offset and process the missing events of the 
> sliding window. The same holds for any use case where the data has to be 
> processed in small batches and these batches do not correspond to the system 
> time.
> 2. Late Arrival
> Messages might get delayed before they are stored into Kafka. In this case 
> one could rewind the offset in order to process older messages corresponding 
> to the same sliding window.
> I'd be happy to further discuss these cases and the proposed feature request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to