[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637621#comment-15637621
]
Cody Koeninger commented on SPARK-18258:
----------------------------------------
So one obvious one is that if wherever checkpoint data is being stored fails or
is corrupted, my downstream database can still be fine and have correct
results, yet I have no way of restarting the job from a known point because the
batch id stored in the database is now meaningless.
Basically, I do not want to introduce another N points of failure in between
Kafka and my data store.
> Sinks need access to offset representation
> ------------------------------------------
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for
> the starting and ending offsets (either the offsets themselves, or the
> SPARK-17829 string/json representation). That would be an API change, but if
> there's another way to map batch ids to offset representations without
> changing the Sink api that would work as well.
> I'm assuming we don't need the same level of access to offsets throughout a
> job as e.g. the Kafka dstream gives, because Sinks are the main place that
> should need them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]