There are already private methods in the code for interacting with Kafka's
offset management api.

There's a jira for making those methods public, but TD has been reluctant
to merge it

https://issues.apache.org/jira/browse/SPARK-10963

I think adding any ZK specific behavior to spark is a bad idea, since ZK
may no longer be the preferred storage location for Kafka offsets within
the next year.



On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans <m...@nicolasevans.org> wrote:

> I really like the Streaming receiverless API for Kafka streaming jobs, but
> I'm finding the manual offset management adds a fair bit of complexity. I'm
> sure that others feel the same way, so I'm proposing that we add the
> ability to have consumer offsets managed via an easy-to-use API. This would
> be done similarly to how it is done in the receiver API.
>
> I haven't written any code yet, but I've looked at the current version of
> the codebase and have an idea of how it could be done.
>
> To keep the size of the pull requests small, I propose that the following
> distinct features are added in order:
>
>    1. If a group ID is set in the Kafka params, and also if fromOffsets
>    is not passed in to createDirectStream, then attempt to resume from the
>    remembered offsets for that group ID.
>    2. Add a method on KafkaRDDs that commits the offsets for that
>    KafkaRDD to Zookeeper.
>    3. Update the Python API with any necessary changes.
>
> My goal is to not break the existing API while adding the new
> functionality.
>
> One point that I'm not sure of is regarding the first point. I'm not sure
> whether it's a better idea to set the group ID as mentioned through Kafka
> params, or to define a new overload of createDirectStream that expects the
> group ID in place of the fromOffsets param. I think the latter is a cleaner
> interface, but I'm not sure whether adding a new param is a good idea.
>
> If anyone has any feedback on this general approach, I'd be very grateful.
> I'm going to open a JIRA in the next couple days and begin working on the
> first point, but I think comments from the community would be very helpful
> on building a good API here.
>
>

Reply via email to