[
https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069708#comment-14069708
]
Tobias Pfeiffer edited comment on SPARK-2492 at 7/22/14 2:23 AM:
-----------------------------------------------------------------
http://kafka.apache.org/08/configuration.html defines "auto.offset.reset" as
"What to do when there is no initial offset in Zookeeper or if an offset is out
of range: [...]". In the 0.7 version of Kafka, cf.
http://kafka.apache.org/07/configuration.html, the parameter "autooffset.reset"
(note the missing dot after "auto") misses this introductory sentence, leading
to a completely different effect.
>From my understanding, the code in Spark tries to achieve the 0.7 documented
>behavior of "autooffset.reset", therefore not complying with the 0.8
>documented behavior of "auto.offset.reset". I think that in order to avoid
>confusion, the parameter "auto.offset.reset" should do exactly what is said in
>the Kafka documentation: If there is no offset stored in Zookeeper, or the
>offset stored there is invalid (e.g., too old), only then the value stored in
>"auto.offset.reset" should be used. (I don't know which component is
>responsible for that, though.)
Having said that, it seems like a common requirement *not* to use the offset
stored in Zookeeper (for example, in order not to overload the Spark Streaming
receiver with a huge number of items on startup). However, as there seems
currently no Kafka configuration parameter that achieves this behavior, I would
suggest an additional parameter when creating a KafkaReceiver in Spark that
resets/deletes the offsets in Zookeeper, similar to what currently happens. (It
should take into account the problem mentioned in SPARK-2383, though.)
was (Author: tgpfeiffer):
http://kafka.apache.org/08/configuration.html defines "auto.offset.reset" as
"What to do when there is no initial offset in Zookeeper or if an offset is out
of range: [...]". In the 0.7 version of Kafka, cf.
http://kafka.apache.org/07/configuration.html, the parameter "autooffset.reset"
(note the missing dot after "auto") misses this introductory sentence, leading
to a completely different effect.
>From my understanding, the code in Spark tries to achieve the 0.7 documented
>behavior of "autooffset.reset", therefore not complying with the 0.8
>documented behavior of "auto.offset.reset". I think that in order to avoid
>confusion, the parameter "auto.offset.reset" should do exactly what is said in
>the Kafka documentation: If there is no offset stored in Zookeeper, or the
>offset stored there is invalid (e.g., too old), only then the value stored in
>"auto.offset.reset" should be used.
Having said that, it seems like a common requirement *not* to use the offset
stored in Zookeeper (for example, in order not to overload the Spark Streaming
receiver with a huge number of items on startup). However, as there seems
currently no Kafka configuration parameter that achieves this behavior, I would
suggest an additional parameter when creating a KafkaReceiver in Spark that
resets/deletes the offsets in Zookeeper, similar to what currently happens. (It
should take into account the problem mentioned in SPARK-2383, though.)
> KafkaReceiver minor changes to align with Kafka 0.8
> ----------------------------------------------------
>
> Key: SPARK-2492
> URL: https://issues.apache.org/jira/browse/SPARK-2492
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 1.0.0
> Reporter: Saisai Shao
> Assignee: Saisai Shao
> Priority: Minor
> Fix For: 1.1.0
>
>
> Update to delete Zookeeper metadata when Kafka's parameter
> "auto.offset.reset" is set to "smallest", which is aligned with Kafka 0.8's
> ConsoleConsumer.
> Also use Kafka offered API without directly using zkClient.
--
This message was sent by Atlassian JIRA
(v6.2#6252)