[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637621#comment-15637621 ]
Cody Koeninger commented on SPARK-18258: ---------------------------------------- So one obvious one is that if wherever checkpoint data is being stored fails or is corrupted, my downstream database can still be fine and have correct results, yet I have no way of restarting the job from a known point because the batch id stored in the database is now meaningless. Basically, I do not want to introduce another N points of failure in between Kafka and my data store. > Sinks need access to offset representation > ------------------------------------------ > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org