Are you asking for commits for every message? Because that will kill performance.
On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Indeed I have. But, even when storing the offsets in Spark and committing > offsets upon completion of an output operation within the foreachRDD call (as > pointed in the example), the only offset that Spark’s Kafka implementation > commits to Kafka is the offset of the last message. For example, if I have > 100 million messages, then Spark will commit only the 100 millionth offset, > and the offsets of the intermediate batches - and hence the questions. > >> On 26 Apr 2017, at 21:42, Cody Koeninger <c...@koeninger.org> wrote: >> >> have you read >> >> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself >> >> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric >> <dominiksafa...@gmail.com> wrote: >>> The reason why I want to obtain this information, i.e. <partition, offset, >>> timestamp> tuples is to relate the consumption with the production rates >>> using the __consumer_offsets Kafka internal topic. Interestedly, the >>> Spark’s KafkaConsumer implementation does not auto commit the offsets upon >>> offset commit expiration, because as seen in the logs, Spark overrides the >>> enable.auto.commit property to false. >>> >>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in >>> mind that I do not care about exactly-once, hence having messages replayed >>> is perfectly fine. >>> >>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote: >>>> >>>> What is it you're actually trying to accomplish? >>>> >>>> You can get topic, partition, and offset bounds from an offset range like >>>> >>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets >>>> >>>> Timestamp isn't really a meaningful idea for a range of offsets. >>>> >>>> >>>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric >>>> <dominiksafa...@gmail.com> wrote: >>>>> Hi all, >>>>> >>>>> Because the Spark Streaming direct Kafka consumer maps offsets for a given >>>>> Kafka topic and a partition internally while having enable.auto.commit set >>>>> to false, how can I retrieve the offset of each made consumer’s poll call >>>>> using the offset ranges of an RDD? More precisely, the information I seek >>>>> to >>>>> get after each poll call is the following: <timestamp, offset, partition>. >>>>> >>>>> Thanks in advance, >>>>> Dominik >>>>> >>> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org