The conversation was mostly getting TD up to speed on this thread since he
had just gotten back from his trip and hadn't seen it.
The jira has a summary of the requirements we discussed, I'm sure TD or
Patrick can add to the ticket if I missed something.
On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com
wrote:
In general such discussions happen or is posted on the dev lists. Could
you please post a summary? Thanks.
Thanks,
Hari
On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
wrote:
After a long talk with Patrick and TD (thanks guys), I opened the
following jira
https://issues.apache.org/jira/browse/SPARK-4964
Sample PR has an impementation for the batch and the dstream case, and a
link to a project with example usage.
On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote:
yup, we at tresata do the idempotent store the same way. very simple
approach.
On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
wrote:
That KafkaRDD code is dead simple.
Given a user specified map
(topic1, partition0) - (startingOffset, endingOffset)
(topic1, partition1) - (startingOffset, endingOffset)
...
turn each one of those entries into a partition of an rdd, using the
simple
consumer.
That's it. No recovery logic, no state, nothing - for any failures,
bail
on the rdd and let it retry.
Spark stays out of the business of being a distributed database.
The client code does any transformation it wants, then stores the data
and
offsets. There are two ways of doing this, either based on idempotence
or
a transactional data store.
For idempotent stores:
1.manipulate data
2.save data to store
3.save ending offsets to the same store
If you fail between 2 and 3, the offsets haven't been stored, you start
again at the same beginning offsets, do the same calculations in the
same
order, overwrite the same data, all is good.
For transactional stores:
1. manipulate data
2. begin transaction
3. save data to the store
4. save offsets
5. commit transaction
If you fail before 5, the transaction rolls back. To make this less
heavyweight, you can write the data outside the transaction and then
update
a pointer to the current data inside the transaction.
Again, spark has nothing much to do with guaranteeing exactly once. In
fact, the current streaming api actively impedes my ability to do the
above. I'm just suggesting providing an api that doesn't get in the
way of
exactly-once.
On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan
hshreedha...@cloudera.com
wrote:
Can you explain your basic algorithm for the once-only-delivery? It is
quite a bit of very Kafka-specific code, that would take more time to
read
than I can currently afford? If you can explain your algorithm a bit,
it
might help.
Thanks,
Hari
On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org
wrote:
The problems you guys are discussing come from trying to store state
in
spark, so don't do that. Spark isn't a distributed database.
Just map kafka partitions directly to rdds, llet user code specify
the
range of offsets explicitly, and let them be in charge of committing
offsets.
Using the simple consumer isn't that bad, I'm already using this in
production with the code I linked to, and tresata apparently has
been as
well. Again, for everyone saying this is impossible, have you read
either
of those implementations and looked at the approach?
On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
Please feel free to correct me if I’m wrong, but I think the exactly
once spark streaming semantics can easily be solved using
updateStateByKey.
Make the key going into updateStateByKey be a hash of the event, or
pluck
off some uuid from the message. The updateFunc would only emit the
message
if the key did not exist, and the user has complete control over
the window
of time / state lifecycle for detecting duplicates. It also makes
it
really easy to detect and take action (alert?) when you DO see a
duplicate,
or make memory tradeoffs within an error bound using a sketch
algorithm.
The kafka simple consumer is insanely complex, if possible I think
it would
be better (and vastly more flexible) to get reliability using the
primitives that spark so elegantly provides.
Cheers,
Sean
On Dec 19, 2014, at 12:06 PM, Hari Shreedharan
hshreedha...@cloudera.com wrote:
Hi Dibyendu,
Thanks for the details on the implementation. But I still do not
believe
that it is no duplicates - what they achieve is that the same
batch is
processed exactly the same way every time (but see it may be
processed
more
than once) - so it depends on the operation being idempotent. I
believe
Trident uses ZK to keep track of the transactions - a batch can be
processed multiple times in