[jira] [Commented] (SAMZA-252) Document stream reprocessing

Jay Kreps (JIRA) Tue, 10 Jun 2014 10:15:36 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026693#comment-14026693
 ]


Jay Kreps commented on SAMZA-252:
---------------------------------

What I was thinking was this. I actually think there is a lot of value in 
documenting and helping people understand a bunch of the different options for 
re-processing.

But after thinking about it I think the simplest approach is a variation on 
Martin's proposal. Consider the simplest case, where you have a Kafka topic as 
input and a database as output that is doing live serving. I think the simplest 
pattern is to have the current job output to a table named output_n and when 
you change the job, deploy a second version of the job at offset 0 outputting 
results to a table output_n+1. Have the application have a switch controlling 
which table it queries (or in an RDBMS you can just use a view that points at 
the current table). This allows you to swap the app back and forth and even A/B 
test the two against one another. I think this is the same as what Martin 
described the only nuance is just carrying the versioning through to the output 
table(s) rather than trying to dedupe on write in the same table which has more 
gotchas.

This can be generalized to the case where you have chained jobs as well, I 
think, in the same way Martin described.

> Document stream reprocessing
> ----------------------------
>
>                 Key: SAMZA-252
>                 URL: https://issues.apache.org/jira/browse/SAMZA-252
>             Project: Samza
>          Issue Type: Improvement
>          Components: docs
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>            Assignee: Martin Kleppmann
>             Fix For: 0.7.0, 0.8.0
>
>         Attachments: SAMZA-252.1.patch
>
>
> A need with stream processing is to want to re-process prior messages at some 
> later date. An example of this is having a stream processing job that is 
> classifying messages in some way using a machine learning algorithm. At some 
> point, the algorithm will be updated with a more accurate vector of weights. 
> When this happens, usually you wish to re-process past messages to get more 
> accurate results. Usually this is solved by running a parallel pipeline from 
> Hadoop.
> We have thought extensively about this use case, and should document how to 
> use Samza in a re-processing use case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-252) Document stream reprocessing

Reply via email to