Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

Amit Ramesh Thu, 11 Jun 2015 22:50:56 -0700

Hi Jerry,

Take a look at this example:
https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

The offsets are needed because as RDDs get generated within spark the
offsets move further along. With direct Kafka mode the current offsets are
no more persisted in Zookeeper but rather within Spark itself. If you want
to be able to use zookeeper based monitoring tools to keep track of
progress, then this is needed.

In my specific case we need to persist Kafka offsets externally so that we
can continue from where we left off after a code deployment. In other
words, we need exactly-once processing guarantees across code deployments.
Spark does not support any state persistence across deployments so this is
something we need to handle on our own.

Hope that helps. Let me know if not.

Thanks!
Amit

On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao <sai.sai.s...@gmail.com>
wrote:

> Hi,
>
> What is your meaning of getting the offsets from the RDD, from my
> understanding, the offsetRange is a parameter you offered to KafkaRDD, why
> do you still want to get the one previous you set into?
>
> Thanks
> Jerry
>
> 2015-06-12 12:36 GMT+08:00 Amit Ramesh <a...@yelp.com>:
>
>>
>> Congratulations on the release of 1.4!
>>
>> I have been trying out the direct Kafka support in python but haven't
>> been able to figure out how to get the offsets from the RDD. Looks like the
>> documentation is yet to be updated to include Python examples (
>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
>> I am specifically looking for the equivalent of
>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
>> I tried digging through the python code but could not find anything
>> related. Any pointers would be greatly appreciated.
>>
>> Thanks!
>> Amit
>>
>>
>

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

Reply via email to