Are you asking for commits for every message?  Because that will kill
performance.

On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
<dominiksafa...@gmail.com> wrote:
> Indeed I have. But, even when storing the offsets in Spark and committing 
> offsets upon completion of an output operation within the foreachRDD call (as 
> pointed in the example), the only offset that Spark’s Kafka implementation 
> commits to Kafka is the offset of the last message. For example, if I have 
> 100 million messages, then Spark will commit only the 100 millionth offset, 
> and the offsets of the intermediate batches - and hence the questions.
>
>> On 26 Apr 2017, at 21:42, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> have you read
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>
>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>> <dominiksafa...@gmail.com> wrote:
>>> The reason why I want to obtain this information, i.e. <partition, offset, 
>>> timestamp> tuples is to relate the consumption with the production rates 
>>> using the __consumer_offsets Kafka internal topic. Interestedly, the 
>>> Spark’s KafkaConsumer implementation does not auto commit the offsets upon 
>>> offset commit expiration, because as seen in the logs, Spark overrides the 
>>> enable.auto.commit property to false.
>>>
>>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
>>> mind that I do not care about exactly-once, hence having messages replayed 
>>> is perfectly fine.
>>>
>>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote:
>>>>
>>>> What is it you're actually trying to accomplish?
>>>>
>>>> You can get topic, partition, and offset bounds from an offset range like
>>>>
>>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>>>
>>>> Timestamp isn't really a meaningful idea for a range of offsets.
>>>>
>>>>
>>>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>>> <dominiksafa...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>>>>> Kafka topic and a partition internally while having enable.auto.commit set
>>>>> to false, how can I retrieve the offset of each made consumer’s poll call
>>>>> using the offset ranges of an RDD? More precisely, the information I seek 
>>>>> to
>>>>> get after each poll call is the following: <timestamp, offset, partition>.
>>>>>
>>>>> Thanks in advance,
>>>>> Dominik
>>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to