[
https://issues.apache.org/jira/browse/SAMZA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026693#comment-14026693
]
Jay Kreps commented on SAMZA-252:
---------------------------------
What I was thinking was this. I actually think there is a lot of value in
documenting and helping people understand a bunch of the different options for
re-processing.
But after thinking about it I think the simplest approach is a variation on
Martin's proposal. Consider the simplest case, where you have a Kafka topic as
input and a database as output that is doing live serving. I think the simplest
pattern is to have the current job output to a table named output_n and when
you change the job, deploy a second version of the job at offset 0 outputting
results to a table output_n+1. Have the application have a switch controlling
which table it queries (or in an RDBMS you can just use a view that points at
the current table). This allows you to swap the app back and forth and even A/B
test the two against one another. I think this is the same as what Martin
described the only nuance is just carrying the versioning through to the output
table(s) rather than trying to dedupe on write in the same table which has more
gotchas.
This can be generalized to the case where you have chained jobs as well, I
think, in the same way Martin described.
> Document stream reprocessing
> ----------------------------
>
> Key: SAMZA-252
> URL: https://issues.apache.org/jira/browse/SAMZA-252
> Project: Samza
> Issue Type: Improvement
> Components: docs
> Affects Versions: 0.6.0
> Reporter: Chris Riccomini
> Assignee: Martin Kleppmann
> Fix For: 0.7.0, 0.8.0
>
> Attachments: SAMZA-252.1.patch
>
>
> A need with stream processing is to want to re-process prior messages at some
> later date. An example of this is having a stream processing job that is
> classifying messages in some way using a machine learning algorithm. At some
> point, the algorithm will be updated with a more accurate vector of weights.
> When this happens, usually you wish to re-process past messages to get more
> accurate results. Usually this is solved by running a parallel pipeline from
> Hadoop.
> We have thought extensively about this use case, and should document how to
> use Samza in a re-processing use case.
--
This message was sent by Atlassian JIRA
(v6.2#6252)