[ 
https://issues.apache.org/jira/browse/STORM-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333413#comment-14333413
 ] 

ASF GitHub Bot commented on STORM-643:
--------------------------------------

Github user pershyn commented on the pull request:

    https://github.com/apache/storm/pull/405#issuecomment-75567762
  
    Hi Xin,
    
    Thank you for the good finding and patch.
    
    I have applied proposed patch on pure storm-0.9.3 branch, deployed and 
testet, and indeed it seems to fix the issue by skipping failed offsets. 
    But, in case there are lots of failed messages that have outdated offset -> 
the spout will slowly ask kafka for each of them and then discarding filed 
message one at a time.
    
    In logs this will look like following: (some log warnings records were 
added for debugging reasons).
    Note, that "new offset" doesn't change.
    
    ```
    2015-02-23T14:19:30.080+0100 s.k.KafkaUtils [WARN] Got fetch request with 
offset out of range: [58094410182]
    2015-02-23T14:19:30.083+0100 s.k.PartitionManager [WARN] Using new offset: 
58094849835
    2015-02-23T14:19:30.083+0100 s.k.PartitionManager [WARN] Removing the 
failed offset that is out of range: 58094410182
    2015-02-23T14:19:30.186+0100 s.k.KafkaUtils [WARN] Got fetch request with 
offset out of range: [58094410183]
    2015-02-23T14:19:30.189+0100 s.k.PartitionManager [WARN] Using new offset: 
58094849835
    2015-02-23T14:19:30.189+0100 s.k.PartitionManager [WARN] Removing the 
failed offset that is out of range: 58094410183
    2015-02-23T14:19:30.291+0100 s.k.KafkaUtils [WARN] Got fetch request with 
offset out of range: [58094411596]
    2015-02-23T14:19:30.293+0100 s.k.PartitionManager [WARN] Using new offset: 
58094849835
    2015-02-23T14:19:30.293+0100 s.k.PartitionManager [WARN] Removing the 
failed offset that is out of range: 58094411596
    2015-02-23T14:19:30.396+0100 s.k.KafkaUtils [WARN] Got fetch request with 
offset out of range: [58094411837]
    ```
    
    So, I have added some logic to skip all the outdated failed offsets in one 
step, see code snippet below:
    
    ```
                //fix bug [STORM-643] : remove this offset from failed list 
when it is OutOfRange
                if (had_failed) {
                    // For the case of EarliestTime it would be better to 
discard
                    // all the failed offsets, that are earlier than actual 
EarliestTime
                    // offset, since they are anyway not there.
                    // These calls to broker API will be then saved.
    
                    // In case of LatestTime - it is a question, if we still 
need to try out and
                    // reach those that are failed (they still may be 
available).
                    // But, by moving to LatestTime we are discarding messages 
in kafka queue.
                    // Since it is configured so, assume that it is ok for user 
to loose information
                    // and user cares about newest messages first.
                    // It makes sense not to do exceptions for those that are 
failed and discard them as well.
    
                    SortedSet<Long> omitted = failed.headSet(_emittedToOffset);
    
                    // Use tail, since sortedSet maintains its elements in 
ascending order
                    // Using tailSet will set a 'range' on original 
implementation
                    // so we couldn't then add objects that are out of range.
                    // For that reason we copy tail into new Set, where range 
is not set.
                    failed = new 
TreeSet<Long>(failed.tailSet(_emittedToOffset));
                    LOG.warn("Removing the failed offsets that are out of 
range: {}", omitted);
                }
    ```
    
    So, then outdated offsets are skipped at once, and we save several 
(hundreds) calls to kafka API:
    
    ```
    2015-02-23T15:07:21.913+0100 s.k.KafkaUtils [WARN] Got fetch request with 
offset out of range: [8786892024]
    2015-02-23T15:07:21.915+0100 s.k.PartitionManager [WARN] Using new offset: 
8789372723
    2015-02-23T15:07:21.916+0100 s.k.PartitionManager [WARN] Removing the 
failed offsets that are out of range: [8786892024, 8786892114, 8786892125, 
8786892127, 8786892170, 8786892207, 8786892217, 8786892317, 8786892405, 
8786892444, 8786892453, 8786892469, 8786892478, 8786892549, 8786892614, 
8786892667, 8786892918, /* ... some omitted ... */ 8786977572, 8786977944, 
8786986501, 8786991794, 8786991797, 8786994497, 8787001536]
    2015-02-23T15:07:32.759+0100 s.k.ZkCoordinator [INFO] Task [7/8] Refreshing 
partition manager connections
    ```
    
    If you are agree with such improvement I would be happy if you extend the 
pull request with this logic.
    
    Best regards and Thanks,
    Michael


> KafkaUtils repeat fetch messages which offset is out of range
> -------------------------------------------------------------
>
>                 Key: STORM-643
>                 URL: https://issues.apache.org/jira/browse/STORM-643
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-kafka
>    Affects Versions: 0.9.2-incubating, 0.9.3
>            Reporter: Xin Wang
>            Assignee: Xin Wang
>            Priority: Critical
>
> KafkaUtils repeat fetch messages which offset is out of range.
> This happened when failed list(SortedSet<Long> failed) is not empty and some 
> offset in it is OutOfRange.
> [worker-log]
> 2015-02-01 10:24:27.231+0800 s.k.KafkaUtils [WARN] Got fetch request with 
> offset out of range: [20919071816]; retrying with default start offset time 
> from configuration. configured start offset time: [-2]
> 2015-02-01 10:24:27.232+0800 s.k.PartitionManager [WARN] Using new offset: 
> 20996130717
> 2015-02-01 10:24:27.333+0800 s.k.KafkaUtils [WARN] Got fetch request with 
> offset out of range: [20919071816]; retrying with default start offset time 
> from configuration. configured start offset time: [-2]
> 2015-02-01 10:24:27.334+0800 s.k.PartitionManager [WARN] Using new offset: 
> 20996130717
> ...
> [FIX]
> storm.kafka.PartitionManager.fill():
> ...
> try {
>       msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, 
> offset);
> } catch (UpdateOffsetException e) {
>        _emittedToOffset = KafkaUtils.getOffset(_consumer, _spoutConfig.topic, 
> _partition.partition, _spoutConfig);
>       LOG.warn("Using new offset: {}", _emittedToOffset);
>       // fetch failed, so don't update the metrics
>       //fix bug: remove this offset from failed list when it is OutOfRange
>       if (had_failed) {
>               failed.remove(offset);
>       }
>             return;
> }
> ...
> also: Log "retrying with default start offset time from configuration. 
> configured start offset time: [-2]" is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to