[
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15638470#comment-15638470
]
Reynold Xin commented on SPARK-18258:
-------------------------------------
This makes sense. It's just extra information you want to be able to see what's
going on.
Can you sketch the API out and put a proposal in the ticket description?
Doesn't need to be very well thought out. It will move the discussion forward.
> Sinks need access to offset representation
> ------------------------------------------
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for
> the starting and ending offsets (either the offsets themselves, or the
> SPARK-17829 string/json representation). That would be an API change, but if
> there's another way to map batch ids to offset representations without
> changing the Sink api that would work as well.
> I'm assuming we don't need the same level of access to offsets throughout a
> job as e.g. the Kafka dstream gives, because Sinks are the main place that
> should need them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]