[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484221#comment-14484221 ] Guozhang Wang commented on KAFKA-1461: -- Thanks for the patch, committed to trunk. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch, KAFKA-1461_2015-03-27_16:56:45.patch, > KAFKA-1461_2015-03-27_17:02:32.patch, KAFKA-1461_2015-04-03_20:48:34.patch, > KAFKA-1461_2015-04-07_08:41:18.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483386#comment-14483386 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch, KAFKA-1461_2015-03-27_16:56:45.patch, > KAFKA-1461_2015-03-27_17:02:32.patch, KAFKA-1461_2015-04-03_20:48:34.patch, > KAFKA-1461_2015-04-07_08:41:18.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395511#comment-14395511 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch, KAFKA-1461_2015-03-27_16:56:45.patch, > KAFKA-1461_2015-03-27_17:02:32.patch, KAFKA-1461_2015-04-03_20:48:34.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384932#comment-14384932 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch, KAFKA-1461_2015-03-27_16:56:45.patch, > KAFKA-1461_2015-03-27_17:02:32.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384922#comment-14384922 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch, KAFKA-1461_2015-03-27_16:56:45.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384813#comment-14384813 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~guozhang] addressed your last review. Please take a look. Thanks. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384810#comment-14384810 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch, > KAFKA-1461_2015-03-27_15:31:11.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379162#comment-14379162 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~guozhang] Thanks for the review. Can you please take a look at my reply to your comment. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378397#comment-14378397 ] Guozhang Wang commented on KAFKA-1461: -- Sorry for the delay, I will take a look at 31366 today. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366286#comment-14366286 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~guozhang] updated the PR as per your review suggestions. Please take a look when you get a chance. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366262#comment-14366262 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch, KAFKA-1461_2015-03-17_16:03:33.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359469#comment-14359469 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Thanks [~junrao] I'll incorporate your and [~guozhang] feedback for RB 31366. Yes we can add that condition at partitionMapCond.await(200L) and use the fetchBackoffMs . I'll send updated pr for it. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359427#comment-14359427 ] Jun Rao commented on KAFKA-1461: Also, in AbstractFetcherThread, we probably should use the configured backoff time, instead of the constant below. partitionMapCond.await(200L, TimeUnit.MILLISECONDS) > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359402#comment-14359402 ] Jun Rao commented on KAFKA-1461: [~sriharsha], for the changes in RB 31366. We need to think through the case when all partitions are inactive. Presumably when that happens, we need to back off the fetching a bit. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359394#comment-14359394 ] Jun Rao commented on KAFKA-1461: [~sriharsha], thanks for the patch. +1 on RB 31927 after fixing the issue in my last comment. Committed to trunk. Now, unit test times went down to 8.5 mins from 12 mins. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359384#comment-14359384 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31927/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch, > KAFKA-1461_2015-03-12_13:54:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359368#comment-14359368 ] Jun Rao commented on KAFKA-1461: Hmm, I don't think we removed or changed any config. Which ones are you referring to? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359365#comment-14359365 ] Joe Stein commented on KAFKA-1461: -- Ignore last comment, deleted... wrong JIRA :) > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359359#comment-14359359 ] Joe Stein commented on KAFKA-1461: -- Here is my reasoning. Say you are an operations person. And, in the next release we tell folks about the KIP to learn and understand changes that affect them (yada yada language for the release). And something like this isn't in there. We are changing the behavior of an existing config and removing another. It makes the communication of behavior incongruent for the changes of a release. So, while I agree we don't need it for this the reason I even brought it up was looking at it from the release perspective for what ops folks are going to be looking at when we get there. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359316#comment-14359316 ] Jun Rao commented on KAFKA-1461: [~charmalloc], I don't think we need a KIP for this either. What other changes require this fix? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358056#comment-14358056 ] Joe Stein commented on KAFKA-1461: -- I personally think it is over kill but i bring it up because it seems to be required for other changes so I am just asking a question. If we are using the KIP to help folks understand the reason behind changes then we should do that and be complete or not. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358040#comment-14358040 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~charmalloc] since there aren't any interface changes I am not sure if a KIP is necessary. Ofcourse we added a new config for replica.fetch.backoff.ms If this warrants a KIP than I can write up one. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358018#comment-14358018 ] Joe Stein commented on KAFKA-1461: -- Shouldn't there be a KIP for this? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357927#comment-14357927 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31927/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch, KAFKA-1461_2015-03-11_18:17:51.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357668#comment-14357668 ] Jun Rao commented on KAFKA-1461: [~sriharsha] and [~guozhang], thinking about this a bit more. There are really two types of states that we need to manage in AbstractFetcherThread. The first one is the connection state, i.e., if a connection breaks, we want to backoff the reconnection. The second one is the partition state, i.e., if the partition hits an exception, we want to backoff that particular partition a bit. The first one is what [~sriharsha]'s current RB is addressing. How about let's complete this first since it affects the performance of the unit tests? Once that's committed, we can address the second one, which is in [~sriharsha]'s initial patch. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357263#comment-14357263 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Updated reviewboard https://reviews.apache.org/r/31927/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch, > KAFKA-1461_2015-03-11_10:41:26.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356344#comment-14356344 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~junrao] [~guozhang] please take a look at the above patch . Let me know if that's what you have in mind. I also added "replica.fetch.backoff.ms" and "controller.socket.timeout.ms" to TestUtils.createBrokerConfig this reduced the total test run time from 15mins to under 10mins on my machine. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356341#comment-14356341 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Created reviewboard https://reviews.apache.org/r/31927/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch, KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355301#comment-14355301 ] Jun Rao commented on KAFKA-1461: [~guozhang], my concern is on the implementation of the DelayedItem. If you create a bunch of DelayedItems with the same timeout, they may timeout slightly differently since the calculation depends on the current time, which can change. In the second case when the leaders are moved one at time, what's going to happen is that the controller will tell the broker to move to the right leader right away. This typically happens within a few milli seconds. We could optimize this case, but I am not sure if it's worth the extra complexity in the code. In the first case, the remaining shutdown process could take seconds after the socket server is shut down. So backing off will definitely help. Perhaps we can just do a simple experiment with controlled shutdown and see how serious the issue is w/o backing off. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355161#comment-14355161 ] Guozhang Wang commented on KAFKA-1461: -- [~junrao] Could you elaborate a bit on "different partitions become active at slightly different times and the fetcher doesn't actually back off"? Not sure I understand why the fetcher does not actually back off. I agree that upon IOException thrown in SimpleConsumer.fetch, we should back off the thread as a whole for common case #1 you mentioned above; but at the same time we should still consider backing off for partition-specific error codes, as otherwise the broker logs will be kind of polluted with all error messages from continuous retries we have seen before. Do you agree? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354272#comment-14354272 ] Jun Rao commented on KAFKA-1461: [~sriharsha], thanks. Your understanding is correct. We can probably just expose an awaitShutdown(timeout) method in ShutdownableThread and call it in AbstractFetcherThread. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354139#comment-14354139 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~junrao] I'll work tomorrow and finish up the patch. Few questions on your recommendations " In AbstractFetcherThread, simply backoff based on the configured time, if it hits an exception when doing a fetch." so instead of handling partitions errors if there is an exception while fetching we will just backoff the AbstractFetcherThread wait until configured time elapsed or a condition is met? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354081#comment-14354081 ] Jun Rao commented on KAFKA-1461: [~sriharsha], since this affects the unit test performance (KAFKA-2010), would you have time to work on this in the next day or two? If not, I can pick it up. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354068#comment-14354068 ] Jun Rao commented on KAFKA-1461: [~sriharsha], thanks for the patch. Managing the backoff per partition is a bit more complicated than I was expecting. The most common case that we want to handle here is that the fetcher is trying to fetch from a broker that's already down. In this case, the simplest approach is to just back off the fetcher (for all partitions) a bit. Another common case is that we are doing a controlled shutdown by moving the leaders off a broker one at the time. The fetcher may get a NotLeader error code for some partitions. In this case, it's less critical to remove those partitions from the fetcher since those partitions will be removed from the fetcher quickly by the leaderAndIsrRequests from the controller. My concern with managing the backoff at the partition level is that if the backoff is out of sync among the partitions, it may happen that different partitions become active at slightly different times and the fetcher doesn't actually back off. Also, the code becomes more complicated. So, my recommendation is the following. (1) Add the backoff config for the replica fetcher. (2) In AbstractFetcherThread, simply backoff based on the configured time, if it hits an exception when doing a fetch. (3) In order to shut down AbstractFetcherThread quickly, the backoff can be implemented on waiting on a new condition. We will signal that new condition during the shutdown. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335166#comment-14335166 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~guozhang] thanks for the pointers. Can you please take a look at the patch when you get a chance. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335164#comment-14335164 ] Sriharsha Chintalapani commented on KAFKA-1461: --- Created reviewboard https://reviews.apache.org/r/31366/diff/ against branch origin/trunk > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320746#comment-14320746 ] Idcmp commented on KAFKA-1461: -- This issue can be tickled on a multi-broker configuration by having brokers advertise host names that do not exist. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290398#comment-14290398 ] Guozhang Wang commented on KAFKA-1461: -- [~sriharsha] Sorry for the late reply. This fix looks good to me overall, except that we cannot potentially add partitions back only in the handlePartitionsWithErrors() call, since it will only be triggered when the next error happens. We can probably move this piece of code to processPartitionData(). Another way to do this could be: 1. Make the partitionMap in AbstractFetcherThread of a map from TopicAndPartition to OffsetAndState, where OffsetAndState contains the Offset (Long) and the State (active, inactive-with-delay). For simplicity we can just use Int here, and "active" would be 0, inactive would be the delay time. 2. Adding another function called "delayPartitions" in AbstractFetcherThread, which set State to inactive with the delay time. 3. In AbstractFetcherThread doWork() only include partitions with State 0 to send the fetch request, and also update the state values for non-zero partitions. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278895#comment-14278895 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~guozhang] I had the following code in my mind about backoff retries incase of any error. This code will be under ReplicaFetcherThread.handlePartitions. I am thinking off maintaining two maps in ReplicaFetcherThread private val partitionsWithErrorStandbyMap = new mutable.HashMap[TopicAndPartition, Long] // a (topic, partition) -> offset private val partitionsWithErrorMap = new mutable.HashMap[TopicAndPartition, Long] // a (topic, partition) -> timestamp one for offset and one for timestamp. remove the partitions from the AbstractFetcherThread.partitionsMap and add back to the map once the currentTime > partitionsWithErrorMap.timestamp + replicaFetcherRetryBackoffMs . I am not quite sure about maintaining these two maps . If its look ok to you , I'll send a patch or if you have any other approach please let me know. ```code def handlePartitionsWithErrors(partitions: Iterable[TopicAndPartition]) { //add to the partitionsWithErrorMap with currentTime. for (partition <- partitions) { if(!partitionsWithErrorMap.contains(partition)) { partitionsWithErrorMap.put(partition, System.currentTimeMillis()) currentOffset(partition) match { case Some(offset: Long) => partitionsWithErrorStandbyMap.put(partition, offset) } } } removePartitions(partitions.toSet) val partitionsToBeAdded = new mutable.HashMap[TopicAndPartition, Long] // process partitionsWithErrorMap and add partitions back if the backoff time elapsed. partitionsWithErrorMap.foreach { case((topicAndPartition, timeMs)) => if(System.currentTimeMillis() > timeMs + brokerConfig.replicaFetcherRetryBackoffMs) { partitionsWithErrorStandbyMap.get(topicAndPartition) match { case Some(offset: Long) => partitionsToBeAdded.put(topicAndPartition, offset) } partitionsWithErrorStandbyMap.remove(topicAndPartition) } } addPartitions(partitionsToBeAdded) } ``` > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226840#comment-14226840 ] Sriharsha Chintalapani commented on KAFKA-1461: --- [~charmalloc] [~nmarasoiu] I can take this. I am looking at this code for another JIRA. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: nicu marasoiu > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225799#comment-14225799 ] Nicolae Marasoiu commented on KAFKA-1461: - I agree to give to someone else, did not made progress yet on this thank you În data de marți, 25 noiembrie 2014, Joe Stein (JIRA) a > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: nicu marasoiu > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225166#comment-14225166 ] Joe Stein commented on KAFKA-1461: -- [~n...@museglobal.ro] are you working on this patch? If not can we assign it to unassigned so if someone wants to jump in and fix it, sure is annoying when it happens (like waiting on Recovering unflushed segment ) during that time every replica fetching from it spews the error ERROR kafka.server.ReplicaFetcherThread > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: nicu marasoiu > Labels: newbie++ > Fix For: 0.8.3 > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137407#comment-14137407 ] Guozhang Wang commented on KAFKA-1461: -- I did not realize this ticket exist, and created the same one here (KAFKA-1629). It has some more detailed explanation of the issue though. > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: nicu marasoiu > Labels: newbie++ > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1461) Replica fetcher thread does not implement any back-off behavior
[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137181#comment-14137181 ] Nicolae Marasoiu commented on KAFKA-1461: - Hi, So I guess in this block: try { trace("Issuing to broker %d of fetch request %s".format(sourceBroker.id, fetchRequest)) response = simpleConsumer.fetch(fetchRequest) } catch { case t: Throwable => if (isRunning.get) { warn("Error in fetch %s. Possible cause: %s".format(fetchRequest, t.toString)) partitionMapLock synchronized { partitionsWithError ++= partitionMap.keys } } } I should add a case for the specific scenario of connection timeout/refused/reset and introduce a backoff on that path? > Replica fetcher thread does not implement any back-off behavior > --- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication >Affects Versions: 0.8.1.1 >Reporter: Sam Meder >Assignee: nicu marasoiu > Labels: newbie++ > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)