[ 
https://issues.apache.org/jira/browse/STORM-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056799#comment-14056799
 ] 

ASF GitHub Bot commented on STORM-399:
--------------------------------------

GitHub user curtisallen opened a pull request:

    https://github.com/apache/incubator-storm/pull/183

    STORM-399 update KafkaConfig.maxOffsetBehind default to be Long.MAX_VALUE

    
    I've recently upgraded to storm and storm-kafka `0.9.2-incubating`, 
replacing the https://github.com/wurstmeister/storm-kafka-0.8-plus spout I was 
using previously.
      
    I have a large kafka log that I needed processed. I started my topology 
with 
    ```
    storm.kafka.SpoutConfig spoutConfig = new SpoutConfig....
    spoutConfig.forceFromStart = true;
    ```
    I then needed to make some tweaks in my application code and restarted the 
topology with spoutConfig.forceFromStart = false. Expecting to pick up where I 
left off in my kafka log. Instead the kafka spout started from the latest 
offset. Upon investigation I found this log message in my storm worker logs
    ```
    2014-07-09 18:02:15 s.k.PartitionManager [INFO] Read last commit offset 
from zookeeper: 15266940; old topology_id: ef3f1f89-f64c-4947-b6eb-0c7fb9adb9ea 
- new topology_id: 5747dba6-c947-4c4f-af4a-4f50a84817bf
    2014-07-09 18:02:15 s.k.PartitionManager [INFO] Last commit offset from 
zookeeper: 15266940
    2014-07-09 18:02:15 s.k.PartitionManager [INFO] Commit offset 22092614 is 
more than 100000 behind, resetting to startOffsetTime=-2
    2014-07-09 18:02:15 s.k.PartitionManager [INFO] Starting Kafka 
prd-use1c-pr-08-kafka-kamq-0004:4 from offset 22092614
    ```
    
    Digging in the storm-kafka spout I found this line
     
https://github.com/apache/incubator-storm/blob/v0.9.2-incubating/external/storm-kafka/src/jvm/storm/kafka/PartitionManager.java#L95
    
    To fix this problem I ended up setting my spout config like so
    ```
    spoutConf.maxOffsetBehind = Long.MAX_VALUE; 
    ```
    Now finally to my question.
    
    Why would the kafka spout skip to the latest offset if the current offset 
is more then 100000 behind by default?
    
    This seems like a bad default value, the spout literally skipped over 
months of data without any warning.   
    
    This pull request sets the default value to `Long.MAX_VALUE`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/curtisallen/incubator-storm 
STORM-399-kafka-spout-increase-default

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-storm/pull/183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #183
    
----
commit 2b1d6cfe9b3d567234246b2e01bfb02b64131302
Author: Curtis Allen <[email protected]>
Date:   2014-07-09T21:43:14Z

    STORM-399 update KafkaConfig.maxOffsetBehind default to be Long.MAX_VALUE

----


> Kafka Spout defaulting to latest offset when current offset is older then 100k
> ------------------------------------------------------------------------------
>
>                 Key: STORM-399
>                 URL: https://issues.apache.org/jira/browse/STORM-399
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>            Reporter: Curtis Allen
>            Priority: Minor
>
> Using storm and storm-kafka 0.9.2-incubating
> In the storm kafka spout the default for maxOffsetBehind is 100000
> see 
> https://github.com/apache/incubator-storm/blob/v0.9.2-incubating/external/storm-kafka/src/jvm/storm/kafka/KafkaConfig.java#L38
> This default is too low and causes the kafka spout to start from the latest 
> offset instead of the last committed offset without warning.
> see 
> https://github.com/apache/incubator-storm/blob/v0.9.2-incubating/external/storm-kafka/src/jvm/storm/kafka/PartitionManager.java#L95
> Producing the following log output from the storm worker processes
> 2014-07-09 18:02:15 s.k.PartitionManager [INFO] Read last commit
> offset from zookeeper: 15266940; old topology_id:
> ef3f1f89-f64c-4947-b6eb-0c7fb9adb9ea - new topology_id:
> 5747dba6-c947-4c4f-af4a-4f50a84817bf
> 2014-07-09 18:02:15 s.k.PartitionManager [INFO] Last commit offset
> from zookeeper: 15266940
> 2014-07-09 18:02:15 s.k.PartitionManager [INFO] Commit offset 22092614
> is more than 100000 behind, resetting to startOffsetTime=-2
> 2014-07-09 18:02:15 s.k.PartitionManager [INFO] Starting Kafka
> prd-use1c-pr-08-kafka-kamq-0004:4 from offset 22092614
> To fix this problem I ended up setting spout config in my topology like so
> spoutConf.maxOffsetBehind = Long.MAX_VALUE;
> Why would the kafka spout skip to the latest offset if the current offset
> is more then 100000 behind by default?
> This seems like a bad default value, the spout literally skipped over
> months of data without any warning.
> Are the core contributors open to accepting a pull request that would set
> the default to Long.MAX_VALUE?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to