[ 
https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069708#comment-14069708
 ] 

Tobias Pfeiffer edited comment on SPARK-2492 at 7/22/14 2:23 AM:
-----------------------------------------------------------------

http://kafka.apache.org/08/configuration.html defines "auto.offset.reset" as 
"What to do when there is no initial offset in Zookeeper or if an offset is out 
of range: [...]". In the 0.7 version of Kafka, cf. 
http://kafka.apache.org/07/configuration.html, the parameter "autooffset.reset" 
(note the missing dot after "auto") misses this introductory sentence, leading 
to a completely different effect.

>From my understanding, the code in Spark tries to achieve the 0.7 documented 
>behavior of "autooffset.reset", therefore not complying with the 0.8 
>documented behavior of "auto.offset.reset". I think that in order to avoid 
>confusion, the parameter "auto.offset.reset" should do exactly what is said in 
>the Kafka documentation: If there is no offset stored in Zookeeper, or the 
>offset stored there is invalid (e.g., too old), only then the value stored in 
>"auto.offset.reset" should be used. (I don't know which component is 
>responsible for that, though.)

Having said that, it seems like a common requirement *not* to use the offset 
stored in Zookeeper (for example, in order not to overload the Spark Streaming 
receiver with a huge number of items on startup). However, as there seems 
currently no Kafka configuration parameter that achieves this behavior, I would 
suggest an additional parameter when creating a KafkaReceiver in Spark that 
resets/deletes the offsets in Zookeeper, similar to what currently happens. (It 
should take into account the problem mentioned in SPARK-2383, though.)


was (Author: tgpfeiffer):
http://kafka.apache.org/08/configuration.html defines "auto.offset.reset" as 
"What to do when there is no initial offset in Zookeeper or if an offset is out 
of range: [...]". In the 0.7 version of Kafka, cf. 
http://kafka.apache.org/07/configuration.html, the parameter "autooffset.reset" 
(note the missing dot after "auto") misses this introductory sentence, leading 
to a completely different effect.

>From my understanding, the code in Spark tries to achieve the 0.7 documented 
>behavior of "autooffset.reset", therefore not complying with the 0.8 
>documented behavior of "auto.offset.reset". I think that in order to avoid 
>confusion, the parameter "auto.offset.reset" should do exactly what is said in 
>the Kafka documentation: If there is no offset stored in Zookeeper, or the 
>offset stored there is invalid (e.g., too old), only then the value stored in 
>"auto.offset.reset" should be used.

Having said that, it seems like a common requirement *not* to use the offset 
stored in Zookeeper (for example, in order not to overload the Spark Streaming 
receiver with a huge number of items on startup). However, as there seems 
currently no Kafka configuration parameter that achieves this behavior, I would 
suggest an additional parameter when creating a KafkaReceiver in Spark that 
resets/deletes the offsets in Zookeeper, similar to what currently happens. (It 
should take into account the problem mentioned in SPARK-2383, though.)

> KafkaReceiver minor changes to align with Kafka 0.8 
> ----------------------------------------------------
>
>                 Key: SPARK-2492
>                 URL: https://issues.apache.org/jira/browse/SPARK-2492
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.0.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>            Priority: Minor
>             Fix For: 1.1.0
>
>
> Update to delete Zookeeper metadata when Kafka's parameter 
> "auto.offset.reset" is set to "smallest", which is aligned with Kafka 0.8's 
> ConsoleConsumer.
> Also use Kafka offered API without directly using zkClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to