Dmitry Goldenberg created SPARK-9434:
----------------------------------------
Summary: Need how-to for resuming direct Kafka streaming consumers
where they had left off before getting terminated, OR actual support for that
mode in the Streaming API
Key: SPARK-9434
URL: https://issues.apache.org/jira/browse/SPARK-9434
Project: Spark
Issue Type: Improvement
Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg
We've been getting some mixed information regarding how to cause our direct
streaming consumers to resume processing from where they left off in terms of
the Kafka offsets.
On the one hand side, we're hearing "If you are restarting the streaming app
with Direct kafka from the checkpoint information (that is, restarting), then
the last read offsets are automatically recovered, and the data will start
processing from that offset. All the N records added in T will stay buffered in
Kafka." (where T is the interval of time during which the consumer was down).
On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which
are marked as "won't fix" which seem to ask for the functionality we need, with
comments like "I don't want to add more config options with confusing semantics
around what is being used for the system of record for offsets, I'd rather make
it easy for people to explicitly do what they need."
The use-case is actually very clear and doesn't ask for confusing semantics. An
API option to resume reading where you left off, in addition to the smallest or
greatest auto.offset.reset should be *very* useful, probably for quite a few
folks.
We're asking for this as an enhancement request. SPARK-8833 states " I am
waiting for getting enough usecase to float in before I take a final call."
We're adding to that.
In the meantime, can you clarify the confusion? Does direct streaming persist
the progress information into "DStream checkpoints" or does it not? If it
does, why is it that we're not seeing that happen? Our consumers start with
auto.offset.reset=greatest and that causes them to read from the first offset
of data that is written to Kafka *after* the consumer has been restarted,
meaning we're missing data that had come in while the consumer was down.
If the progress is stored in "DStream checkpoints", we want to know a) how to
cause that to work for us and b) where the said checkpointing data is stored
physically.
Conversely, if this is not accurate, then is our only choice to manually
persist the offsets into Zookeeper? If that is the case then a) we'd like a
clear, more complete code sample to be published, since the one in the Kafka
streaming guide is incomplete (it lacks the actual lines of code persisting the
offsets) and b) we'd like to request that SPARK-8833 be revisited as a feature
worth implementing in the API.
Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]