[jira] [Created] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)
Arjun Satish created KAFKA-7909:
---

 Summary: Coordinator changes cause Connect integration test to fail
 Key: KAFKA-7909
 URL: https://issues.apache.org/jira/browse/KAFKA-7909
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Arjun Satish
 Fix For: 2.2.0


We recently introduced integration tests in Connect. This test spins up one or 
more Connect workers along with a Kafka broker and Zk in a single process and 
attempts to move records using a Connector. In the Example Integration Test, we 
spin up three workers each hosting a Connector task that consumes records from 
a Kafka topic. When the connector starts up, it may go through multiple rounds 
of rebalancing. We notice the following two problems in the last few days:
 # After members join a group, there are no pendingMembers remaining, but the 
join group method does not complete, and send these members a signal that they 
are not ready to start consuming from their respective partitions.
 # Because of quick rebalances, a consumer might have started a group, but 
Connect starts  a rebalance, after we which we create three new instances of 
the consumer (one from each worker/task). But the group coordinator seems to 
have 4 members in the group. This causes the JoinGroup to indefinitely stall. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arjun Satish updated KAFKA-7909:

Component/s: core

> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the Example Integration 
> Test, we spin up three workers each hosting a Connector task that consumes 
> records from a Kafka topic. When the connector starts up, it may go through 
> multiple rounds of rebalancing. We notice the following two problems in the 
> last few days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arjun Satish updated KAFKA-7909:

Component/s: consumer

> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the Example Integration 
> Test, we spin up three workers each hosting a Connector task that consumes 
> records from a Kafka topic. When the connector starts up, it may go through 
> multiple rounds of rebalancing. We notice the following two problems in the 
> last few days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arjun Satish updated KAFKA-7909:

Description: 
We recently introduced integration tests in Connect. This test spins up one or 
more Connect workers along with a Kafka broker and Zk in a single process and 
attempts to move records using a Connector. In the Example Integration Test, we 
spin up three workers each hosting a Connector task that consumes records from 
a Kafka topic. When the connector starts up, it may go through multiple rounds 
of rebalancing. We notice the following two problems in the last few days:
 # After members join a group, there are no pendingMembers remaining, but the 
join group method does not complete, and send these members a signal that they 
are not ready to start consuming from their respective partitions.
 # Because of quick rebalances, a consumer might have started a group, but 
Connect starts  a rebalance, after we which we create three new instances of 
the consumer (one from each worker/task). But the group coordinator seems to 
have 4 members in the group. This causes the JoinGroup to indefinitely stall. 

Even though this ticket is described in the connect of Connect, it may be 
applicable to general consumers.

  was:
We recently introduced integration tests in Connect. This test spins up one or 
more Connect workers along with a Kafka broker and Zk in a single process and 
attempts to move records using a Connector. In the Example Integration Test, we 
spin up three workers each hosting a Connector task that consumes records from 
a Kafka topic. When the connector starts up, it may go through multiple rounds 
of rebalancing. We notice the following two problems in the last few days:
 # After members join a group, there are no pendingMembers remaining, but the 
join group method does not complete, and send these members a signal that they 
are not ready to start consuming from their respective partitions.
 # Because of quick rebalances, a consumer might have started a group, but 
Connect starts  a rebalance, after we which we create three new instances of 
the consumer (one from each worker/task). But the group coordinator seems to 
have 4 members in the group. This causes the JoinGroup to indefinitely stall. 


> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the Example Integration 
> Test, we spin up three workers each hosting a Connector task that consumes 
> records from a Kafka topic. When the connector starts up, it may go through 
> multiple rounds of rebalancing. We notice the following two problems in the 
> last few days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arjun Satish updated KAFKA-7909:

Description: 
We recently introduced integration tests in Connect. This test spins up one or 
more Connect workers along with a Kafka broker and Zk in a single process and 
attempts to move records using a Connector. In the [Example Integration 
Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
 we spin up three workers each hosting a Connector task that consumes records 
from a Kafka topic. When the connector starts up, it may go through multiple 
rounds of rebalancing. We notice the following two problems in the last few 
days:
 # After members join a group, there are no pendingMembers remaining, but the 
join group method does not complete, and send these members a signal that they 
are not ready to start consuming from their respective partitions.
 # Because of quick rebalances, a consumer might have started a group, but 
Connect starts  a rebalance, after we which we create three new instances of 
the consumer (one from each worker/task). But the group coordinator seems to 
have 4 members in the group. This causes the JoinGroup to indefinitely stall. 

Even though this ticket is described in the connect of Connect, it may be 
applicable to general consumers.

  was:
We recently introduced integration tests in Connect. This test spins up one or 
more Connect workers along with a Kafka broker and Zk in a single process and 
attempts to move records using a Connector. In the Example Integration Test, we 
spin up three workers each hosting a Connector task that consumes records from 
a Kafka topic. When the connector starts up, it may go through multiple rounds 
of rebalancing. We notice the following two problems in the last few days:
 # After members join a group, there are no pendingMembers remaining, but the 
join group method does not complete, and send these members a signal that they 
are not ready to start consuming from their respective partitions.
 # Because of quick rebalances, a consumer might have started a group, but 
Connect starts  a rebalance, after we which we create three new instances of 
the consumer (one from each worker/task). But the group coordinator seems to 
have 4 members in the group. This causes the JoinGroup to indefinitely stall. 

Even though this ticket is described in the connect of Connect, it may be 
applicable to general consumers.


> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the [Example Integration 
> Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
>  we spin up three workers each hosting a Connector task that consumes records 
> from a Kafka topic. When the connector starts up, it may go through multiple 
> rounds of rebalancing. We notice the following two problems in the last few 
> days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7910) Document retention.ms behavior with record timestamp

2019-02-08 Thread Stanislav Kozlovski (JIRA)
Stanislav Kozlovski created KAFKA-7910:
--

 Summary: Document retention.ms behavior with record timestamp
 Key: KAFKA-7910
 URL: https://issues.apache.org/jira/browse/KAFKA-7910
 Project: Kafka
  Issue Type: Improvement
Reporter: Stanislav Kozlovski


It is intuitive to believe that `log.retention.ms` starts applying once a log 
file is closed.
The documentation says:
> This configuration controls the maximum time we will retain a log before we 
>will discard old log segments to free up space if we are using the "delete" 
>retention policy.
Yet, the actual behavior is that we take into account the largest timestamp of 
that segment file 
([https://github.com/apache/kafka/blob/4cdbb3e5c19142d118f0f3999dd3e21deccb3643/core/src/main/scala/kafka/log/Log.scala#L1246)|https://github.com/apache/kafka/blob/4cdbb3e5c19142d118f0f3999dd3e21deccb3643/core/src/main/scala/kafka/log/Log.scala#L1246).]
 and then consider `retention.ms` on top of that.

This means that if Kafka is configured with 
`log.message.timestamp.type=CreateTime` (as it is by default), any records that 
have a future timestamp set by the producer will not get deleted as expected by 
the initial intuition (and documentation) of the `log.retention.ms`.

We should document the behavior of `retention.ms` with the record timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763622#comment-16763622
 ] 

Ismael Juma commented on KAFKA-7909:


Thanks for the report. Do you know which change caused this regression?

> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the [Example Integration 
> Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
>  we spin up three workers each hosting a Connector task that consumes records 
> from a Kafka topic. When the connector starts up, it may go through multiple 
> rounds of rebalancing. We notice the following two problems in the last few 
> days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7897) Invalid use of epoch cache following message format downgrade

2019-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763766#comment-16763766
 ] 

ASF GitHub Bot commented on KAFKA-7897:
---

hachikuji commented on pull request #6232: KAFKA-7897; Disable leader epoch 
cache when older message formats are used
URL: https://github.com/apache/kafka/pull/6232
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Invalid use of epoch cache following message format downgrade
> -
>
> Key: KAFKA-7897
> URL: https://issues.apache.org/jira/browse/KAFKA-7897
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Message format downgrades are not supported, but they generally work as long 
> as broker/clients at least can continue to parse both message formats. After 
> a downgrade, the truncation logic should revert to using the high watermark, 
> but currently we use the existence of any cached epoch as the sole 
> prerequisite in order to leverage OffsetsForLeaderEpoch. This has the effect 
> of causing a massive truncation after startup which causes re-replication.
> I think our options to fix this are to either 1) clear the cache when we 
> notice a downgrade, or 2) forbid downgrades and raise an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7897) Invalid use of epoch cache with old message format versions

2019-02-08 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-7897:
---
Summary: Invalid use of epoch cache with old message format versions  (was: 
Invalid use of epoch cache following message format downgrade)

> Invalid use of epoch cache with old message format versions
> ---
>
> Key: KAFKA-7897
> URL: https://issues.apache.org/jira/browse/KAFKA-7897
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Message format downgrades are not supported, but they generally work as long 
> as broker/clients at least can continue to parse both message formats. After 
> a downgrade, the truncation logic should revert to using the high watermark, 
> but currently we use the existence of any cached epoch as the sole 
> prerequisite in order to leverage OffsetsForLeaderEpoch. This has the effect 
> of causing a massive truncation after startup which causes re-replication.
> I think our options to fix this are to either 1) clear the cache when we 
> notice a downgrade, or 2) forbid downgrades and raise an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arjun Satish reassigned KAFKA-7909:
---

Assignee: Arjun Satish

> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Assignee: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the [Example Integration 
> Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
>  we spin up three workers each hosting a Connector task that consumes records 
> from a Kafka topic. When the connector starts up, it may go through multiple 
> rounds of rebalancing. We notice the following two problems in the last few 
> days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7864) AdminZkClient.validateTopicCreate() should validate that partitions are 0-based

2019-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763858#comment-16763858
 ] 

ASF GitHub Bot commented on KAFKA-7864:
---

ctchenn commented on pull request #6246: KAFKA-7864; validate partitions are 
0-based
URL: https://github.com/apache/kafka/pull/6246
 
 
in AdminZkClient.validateTopicCreate(), check the partition ids are 
consecutive and 0-based
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AdminZkClient.validateTopicCreate() should validate that partitions are 
> 0-based
> ---
>
> Key: KAFKA-7864
> URL: https://issues.apache.org/jira/browse/KAFKA-7864
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Ryan
>Priority: Major
>  Labels: newbie
>
> AdminZkClient.validateTopicCreate() currently doesn't validate that partition 
> ids in a topic are consecutive, starting from 0. The client code depends on 
> that. So, it would be useful to tighten up the check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7884) Docs for message.format.version and log.message.format.version show invalid (corrupt?) "valid values"

2019-02-08 Thread Colin P. McCabe (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin P. McCabe updated KAFKA-7884:
---
Fix Version/s: (was: 2.1.1)

> Docs for message.format.version and log.message.format.version show invalid 
> (corrupt?) "valid values"
> -
>
> Key: KAFKA-7884
> URL: https://issues.apache.org/jira/browse/KAFKA-7884
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation
>Reporter: James Cheng
>Assignee: Lee Dongjin
>Priority: Major
> Fix For: 2.2.0
>
>
> In the docs for message.format.version and log.message.format.version, the 
> list of valid values is
>  
> {code:java}
> kafka.api.ApiVersionValidator$@56aac163 
> {code}
>  
> It appears it's simply doing a .toString on the class/instance.
> At a minimum, we should remove this java-y-ness.
> Even better is, it should show all the valid values.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7880) KafkaConnect should standardize worker thread name

2019-02-08 Thread Colin P. McCabe (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin P. McCabe updated KAFKA-7880:
---
Fix Version/s: (was: 2.1.1)

> KafkaConnect should standardize worker thread name
> --
>
> Key: KAFKA-7880
> URL: https://issues.apache.org/jira/browse/KAFKA-7880
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 2.1.0
>Reporter: YeLiang
>Assignee: YeLiang
>Priority: Minor
>
> KafkaConnect will create a WorkerTask for tasks assigned to it and then 
> submit tasks to a thread pool.
> However,the 
> [Worker|https://github.com/apache/kafka/blob/2.1.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java]
>  class initializes its thread pool using a default ThreadFactory.So the 
> thread name will have a pattern pool-[0-9]\-thread\-[0-9].
> When we are running KafkaConnect and find that one of the task thread is 
> under high CPU usage, it is difficult for us to find out which task is under 
> high load becasue when we print out the stack of KafkaConnect, we can only 
> see a list of threads name pool-[0-9]\-thread\-[0-9] even if we can know the 
> exact pid of the high CPU usage thread
> If worker threads name will be named like connectorName-taskId, it will be 
> very helpful



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7876) Broker suddenly got disconnected

2019-02-08 Thread Colin P. McCabe (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin P. McCabe updated KAFKA-7876:
---
Fix Version/s: (was: 2.1.1)

> Broker suddenly got disconnected 
> -
>
> Key: KAFKA-7876
> URL: https://issues.apache.org/jira/browse/KAFKA-7876
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, network
>Affects Versions: 2.1.0
>Reporter: Anthony Lazam
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: kafka-issue.png
>
>
>  
> We have 3 node cluster setup. There are scenarios that one of the broker 
> suddenly got disconnected from the cluster but no underlying system issue is 
> found. The node that got dc'ed wasn't able to release the partition it holds 
> as the leader, hence clients (spring-boot) was unable to send/receive data 
> from the issued broker.
> We noticed that it always happen to the active controller count.
> Environment details:
> Provider: AWS
>  Kernel: 3.10.0-693.21.1.el7.x86_64
>  OS: CentOS Linux release 7.5.1804 (Core)
>  Scala version: 2.11
>  Kafka version: 2.1.0
>  Kafka config:
> {code:java}
> # Socket Server Settings 
> #
> num.network.threads=3
> num.io.threads=8
> socket.send.buffer.bytes=102400
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> # Log Basics #
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> # Internal Topic Settings  
> #
> offsets.topic.replication.factor=3
> transaction.state.log.replication.factor=3
> transaction.state.log.min.isr=2
> # Log Retention Policy 
> #
> log.retention.hours=168
> log.segment.bytes=1073741824
> log.retention.check.interval.ms=30
> # Group Coordinator Settings 
> #
> group.initial.rebalance.delay.ms=0
> # Zookeeper #
> zookeeper.connection.timeout.ms=6000
> broker.id=1
> zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
> log.dirs=/data/kafka-node
> advertised.listeners=PLAINTEXT://node1:9092
> {code}
> Broker disconnected controller log:
> {code:java}
> [2019-01-26 05:03:52,512] TRACE [Controller id=2] Checking need to trigger 
> auto leader balancing (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Preferred replicas by 
> broker Map(TOPICS->MAP) (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 2 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 2 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 1 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 3 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 3 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:08:52,513] TRACE [Controller id=2] Checking need to trigger 
> auto leader balancing (kafka.controller.KafkaController)
> {code}
> Broker working server.log:
> {code:java}
> [2019-01-26 05:02:05,564] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1637095899, 
> epoch=21379644) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-26 05:02:05,573] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={PlayerGameRounds-8=(offset=2171960, logStartOffset=1483356, 
> maxBytes=1048576, currentLeaderEpoch=Optional[2])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessio
> nId=1637095899, epoch=21379644)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> 

[jira] [Commented] (KAFKA-7909) Coordinator changes cause Connect integration test to fail

2019-02-08 Thread Arjun Satish (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763812#comment-16763812
 ] 

Arjun Satish commented on KAFKA-7909:
-

[~ijuma] I believe [this 
commit|https://github.com/apache/kafka/commit/9a9310d074ead70ebf3e93d29d880e094b9080f6]
 caused the regression. 

> Coordinator changes cause Connect integration test to fail
> --
>
> Key: KAFKA-7909
> URL: https://issues.apache.org/jira/browse/KAFKA-7909
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, core
>Affects Versions: 2.2.0
>Reporter: Arjun Satish
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We recently introduced integration tests in Connect. This test spins up one 
> or more Connect workers along with a Kafka broker and Zk in a single process 
> and attempts to move records using a Connector. In the [Example Integration 
> Test|https://github.com/apache/kafka/blob/3c73633/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExampleConnectIntegrationTest.java#L105],
>  we spin up three workers each hosting a Connector task that consumes records 
> from a Kafka topic. When the connector starts up, it may go through multiple 
> rounds of rebalancing. We notice the following two problems in the last few 
> days:
>  # After members join a group, there are no pendingMembers remaining, but the 
> join group method does not complete, and send these members a signal that 
> they are not ready to start consuming from their respective partitions.
>  # Because of quick rebalances, a consumer might have started a group, but 
> Connect starts  a rebalance, after we which we create three new instances of 
> the consumer (one from each worker/task). But the group coordinator seems to 
> have 4 members in the group. This causes the JoinGroup to indefinitely stall. 
> Even though this ticket is described in the connect of Connect, it may be 
> applicable to general consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7911) Add Timezone Support for Windowed Aggregations

2019-02-08 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created KAFKA-7911:
--

 Summary: Add Timezone Support for Windowed Aggregations
 Key: KAFKA-7911
 URL: https://issues.apache.org/jira/browse/KAFKA-7911
 Project: Kafka
  Issue Type: New Feature
  Components: streams
Reporter: Matthias J. Sax


Currently, Kafka Streams only support UTC timestamps. The impact is, that 
`TimeWindows` are based on UTC time only. This is problematic for 24h windows, 
because windows are build aligned to UTC-days, but not your local time zone.

While it's possible to "shift" timestamps as a workaround, it would be better 
to allow native timezone support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7912) In-memory key-value store does not support concurrent access

2019-02-08 Thread Sophie Blee-Goldman (JIRA)
Sophie Blee-Goldman created KAFKA-7912:
--

 Summary: In-memory key-value store does not support concurrent 
access 
 Key: KAFKA-7912
 URL: https://issues.apache.org/jira/browse/KAFKA-7912
 Project: Kafka
  Issue Type: Bug
Reporter: Sophie Blee-Goldman
Assignee: Sophie Blee-Goldman


Currently, the in-memory key-value store uses a Map to store key-value pairs 
and fetches them by calling subMap and returning an iterator to this submap. 
This is unsafe as the submap is just a view of the original map and there is 
risk of concurrent access.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7466) Implement KIP-339: Create a new IncrementalAlterConfigs API

2019-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764001#comment-16764001
 ] 

ASF GitHub Bot commented on KAFKA-7466:
---

omkreddy commented on pull request #6247: KAFKA-7466: Add 
IncrementalAlterConfigs API (KIP-339)
URL: https://github.com/apache/kafka/pull/6247
 
 
   - 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-339%3A+Create+a+new+IncrementalAlterConfigs+API
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement KIP-339: Create a new IncrementalAlterConfigs API
> ---
>
> Key: KAFKA-7466
> URL: https://issues.apache.org/jira/browse/KAFKA-7466
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin P. McCabe
>Assignee: Manikumar
>Priority: Major
>
> Implement KIP-339: Create a new IncrementalAlterConfigs API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)