[
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645992#comment-14645992
]
Dmitry Goldenberg commented on SPARK-9434:
------------------------------------------
@Sean Owen: Per your point about placing this discussion into the mailing list,
I [had placed my question
there|http://apache-spark-user-list.1001560.n3.nabble.com/What-happens-when-a-streaming-consumer-job-is-killed-then-restarted-tt23348.html]
and it has remained unanswered there since June 16th.
> Need how-to for resuming direct Kafka streaming consumers where they had left
> off before getting terminated, OR actual support for that mode in the
> Streaming API
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-9434
> URL: https://issues.apache.org/jira/browse/SPARK-9434
> Project: Spark
> Issue Type: Improvement
> Components: Documentation, Examples, Streaming
> Affects Versions: 1.4.1
> Reporter: Dmitry Goldenberg
>
> We've been getting some mixed information regarding how to cause our direct
> streaming consumers to resume processing from where they left off in terms of
> the Kafka offsets.
> On the one hand side, we're hearing "If you are restarting the streaming app
> with Direct kafka from the checkpoint information (that is, restarting), then
> the last read offsets are automatically recovered, and the data will start
> processing from that offset. All the N records added in T will stay buffered
> in Kafka." (where T is the interval of time during which the consumer was
> down).
> On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which
> are marked as "won't fix" which seem to ask for the functionality we need,
> with comments like "I don't want to add more config options with confusing
> semantics around what is being used for the system of record for offsets, I'd
> rather make it easy for people to explicitly do what they need."
> The use-case is actually very clear and doesn't ask for confusing semantics.
> An API option to resume reading where you left off, in addition to the
> smallest or greatest auto.offset.reset should be *very* useful, probably for
> quite a few folks.
> We're asking for this as an enhancement request. SPARK-8833 states " I am
> waiting for getting enough usecase to float in before I take a final call."
> We're adding to that.
> In the meantime, can you clarify the confusion? Does direct streaming
> persist the progress information into "DStream checkpoints" or does it not?
> If it does, why is it that we're not seeing that happen? Our consumers start
> with auto.offset.reset=greatest and that causes them to read from the first
> offset of data that is written to Kafka *after* the consumer has been
> restarted, meaning we're missing data that had come in while the consumer was
> down.
> If the progress is stored in "DStream checkpoints", we want to know a) how to
> cause that to work for us and b) where the said checkpointing data is stored
> physically.
> Conversely, if this is not accurate, then is our only choice to manually
> persist the offsets into Zookeeper? If that is the case then a) we'd like a
> clear, more complete code sample to be published, since the one in the Kafka
> streaming guide is incomplete (it lacks the actual lines of code persisting
> the offsets) and b) we'd like to request that SPARK-8833 be revisited as a
> feature worth implementing in the API.
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]