[jira] [Comment Edited] (KAFKA-4166) TestMirrorMakerService.test_bounce transient system test failure

2016-12-13 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747454#comment-15747454
 ] 

Roger Hoover edited comment on KAFKA-4166 at 12/14/16 6:37 AM:
---

It happened twice more:

{code}
Module: kafkatest.tests.core.mirror_maker_test
Class:  TestMirrorMakerService
Method: test_bounce
Arguments:
{
  "clean_shutdown": true,
  "new_consumer": false,
  "offsets_storage": "kafka"
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_bounce/clean_shutdown%3DTrue.offsets_storage%3Dkafka.new_consumer%3DFalse/51.tgz

and

{code}
Module: kafkatest.tests.core.mirror_maker_test
Class: TestMirrorMakerService
Method: test_simple_end_to_end
Arguments: {
   "security_protocol": "PLAINTEXT",
"new_consumer": false
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_simple_end_to_end/security_protocol%3DPLAINTEXT.new_consumer%3DFalse/55.tgz




was (Author: theduderog):
It happened twice more on these tests:

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class:  TestMirrorMakerService
Method: test_bounce
Arguments:
{
  "clean_shutdown": true,
  "new_consumer": false,
  "offsets_storage": "kafka"
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_bounce/clean_shutdown%3DTrue.offsets_storage%3Dkafka.new_consumer%3DFalse/51.tgz

and

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class: TestMirrorMakerService
Method: test_simple_end_to_end
Arguments: {
   ""security_protocol': ""PLAINTEXT"",
""new_consumer"": false
}"
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_simple_end_to_end/security_protocol%3DPLAINTEXT.new_consumer%3DFalse/55.tgz



> TestMirrorMakerService.test_bounce transient system test failure
> 
>
> Key: KAFKA-4166
> URL: https://issues.apache.org/jira/browse/KAFKA-4166
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>  Labels: transient-system-test-failure
>
> We've only seen one failure so far and it's a timeout error so it could be an 
> environment issue. Filing it here so that we can track it in case there are 
> additional failures:
> {code}
> Module: kafkatest.tests.core.mirror_maker_test
> Class:  TestMirrorMakerService
> Method: test_bounce
> Arguments:
> {
>   "clean_shutdown": true,
>   "new_consumer": true,
>   "security_protocol": "SASL_SSL"
> }
> {code}
>  
> {code}
> test_id:
> 2016-09-12--001.kafkatest.tests.core.mirror_maker_test.TestMirrorMakerService.test_bounce.clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True
> status: FAIL
> run time:   3 minutes 30.354 seconds
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 106, in run_all_tests
> data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 162, in run_single_test
> return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/mark/_mark.py",
>  line 331, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/mirror_maker_test.py",
>  line 178, in test_bounce
> self.run_produce_consume_validate(core_test_action=lambda: 
> self.bounce(clean_shutdown=clean_shutdown))
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in run_produce_consume_validate
> raise e
> TimeoutError
> {code}
>  
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-09-12--001.1473700895--apache--trunk--a7ab9cb/TestMirrorMakerService/test_bounce/clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True.tgz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4166) TestMirrorMakerService.test_bounce transient system test failure

2016-12-13 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747454#comment-15747454
 ] 

Roger Hoover edited comment on KAFKA-4166 at 12/14/16 6:36 AM:
---

It happened twice more on these tests:

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class:  TestMirrorMakerService
Method: test_bounce
Arguments:
{
  "clean_shutdown": true,
  "new_consumer": false,
  "offsets_storage": "kafka"
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_bounce/clean_shutdown%3DTrue.offsets_storage%3Dkafka.new_consumer%3DFalse/51.tgz

and

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class: TestMirrorMakerService
Method: test_simple_end_to_end
Arguments: {
   ""security_protocol': ""PLAINTEXT"",
""new_consumer"": false
}"
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_simple_end_to_end/security_protocol%3DPLAINTEXT.new_consumer%3DFalse/55.tgz




was (Author: theduderog):
It happened twice more on these test:

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class:  TestMirrorMakerService
Method: test_bounce
Arguments:
{
  "clean_shutdown": true,
  "new_consumer": false,
  "offsets_storage": "kafka"
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_bounce/clean_shutdown%3DTrue.offsets_storage%3Dkafka.new_consumer%3DFalse/51.tgz

and

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class: TestMirrorMakerService
Method: test_simple_end_to_end
Arguments: {
   ""security_protocol': ""PLAINTEXT"",
""new_consumer"": false
}"
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_simple_end_to_end/security_protocol%3DPLAINTEXT.new_consumer%3DFalse/55.tgz



> TestMirrorMakerService.test_bounce transient system test failure
> 
>
> Key: KAFKA-4166
> URL: https://issues.apache.org/jira/browse/KAFKA-4166
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>  Labels: transient-system-test-failure
>
> We've only seen one failure so far and it's a timeout error so it could be an 
> environment issue. Filing it here so that we can track it in case there are 
> additional failures:
> {code}
> Module: kafkatest.tests.core.mirror_maker_test
> Class:  TestMirrorMakerService
> Method: test_bounce
> Arguments:
> {
>   "clean_shutdown": true,
>   "new_consumer": true,
>   "security_protocol": "SASL_SSL"
> }
> {code}
>  
> {code}
> test_id:
> 2016-09-12--001.kafkatest.tests.core.mirror_maker_test.TestMirrorMakerService.test_bounce.clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True
> status: FAIL
> run time:   3 minutes 30.354 seconds
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 106, in run_all_tests
> data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 162, in run_single_test
> return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/mark/_mark.py",
>  line 331, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/mirror_maker_test.py",
>  line 178, in test_bounce
> self.run_produce_consume_validate(core_test_action=lambda: 
> self.bounce(clean_shutdown=clean_shutdown))
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in run_produce_consume_validate
> raise e
> TimeoutError
> {code}
>  
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-09-12--001.1473700895--apache--trunk--a7ab9cb/TestMirrorMakerService/test_bounce/clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True.tgz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4166) TestMirrorMakerService.test_bounce transient system test failure

2016-12-13 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747454#comment-15747454
 ] 

Roger Hoover commented on KAFKA-4166:
-

It happened twice more on these test:

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class:  TestMirrorMakerService
Method: test_bounce
Arguments:
{
  "clean_shutdown": true,
  "new_consumer": false,
  "offsets_storage": "kafka"
}
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_bounce/clean_shutdown%3DTrue.offsets_storage%3Dkafka.new_consumer%3DFalse/51.tgz

and

{code}
"Module: kafkatest.tests.core.mirror_maker_test
Class: TestMirrorMakerService
Method: test_simple_end_to_end
Arguments: {
   ""security_protocol': ""PLAINTEXT"",
""new_consumer"": false
}"
{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-13--001.1481621566--apache--trunk--21d7e6f/TestMirrorMakerService/test_simple_end_to_end/security_protocol%3DPLAINTEXT.new_consumer%3DFalse/55.tgz



> TestMirrorMakerService.test_bounce transient system test failure
> 
>
> Key: KAFKA-4166
> URL: https://issues.apache.org/jira/browse/KAFKA-4166
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>  Labels: transient-system-test-failure
>
> We've only seen one failure so far and it's a timeout error so it could be an 
> environment issue. Filing it here so that we can track it in case there are 
> additional failures:
> {code}
> Module: kafkatest.tests.core.mirror_maker_test
> Class:  TestMirrorMakerService
> Method: test_bounce
> Arguments:
> {
>   "clean_shutdown": true,
>   "new_consumer": true,
>   "security_protocol": "SASL_SSL"
> }
> {code}
>  
> {code}
> test_id:
> 2016-09-12--001.kafkatest.tests.core.mirror_maker_test.TestMirrorMakerService.test_bounce.clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True
> status: FAIL
> run time:   3 minutes 30.354 seconds
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 106, in run_all_tests
> data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py",
>  line 162, in run_single_test
> return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape/mark/_mark.py",
>  line 331, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/mirror_maker_test.py",
>  line 178, in test_bounce
> self.run_produce_consume_validate(core_test_action=lambda: 
> self.bounce(clean_shutdown=clean_shutdown))
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in run_produce_consume_validate
> raise e
> TimeoutError
> {code}
>  
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-09-12--001.1473700895--apache--trunk--a7ab9cb/TestMirrorMakerService/test_bounce/clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True.tgz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1696) Kafka should be able to generate Hadoop delegation tokens

2016-12-13 Thread Manikumar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747403#comment-15747403
 ] 

Manikumar Reddy commented on KAFKA-1696:


[~singhashish] Yes, Pl go ahead and create sub-jiras and PRs , So that we avoid 
any duplicate efforts.

> Kafka should be able to generate Hadoop delegation tokens
> -
>
> Key: KAFKA-1696
> URL: https://issues.apache.org/jira/browse/KAFKA-1696
> Project: Kafka
>  Issue Type: Sub-task
>  Components: security
>Reporter: Jay Kreps
>Assignee: Parth Brahmbhatt
>
> For access from MapReduce/etc jobs run on behalf of a user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-98: Exactly Once Delivery and Transactional Messaging

2016-12-13 Thread Ben Kirwin
Hi Apurva,

Thanks for the detailed answers... and sorry for the late reply!

It does sound like, if the input-partitions-to-app-id mapping never
changes, the existing fencing mechanisms should prevent duplicates. Great!
I'm a bit concerned the proposed API will be delicate to program against
successfully -- even in the simple case, we need to create a new producer
instance per input partition, and anything fancier is going to need its own
implementation of the Streams/Samza-style 'task' idea -- but that may be
fine for this sort of advanced feature.

For the second question, I notice that Jason also elaborated on this
downthread:

> We also looked at removing the producer ID.
> This was discussed somewhere above, but basically the idea is to store the
> AppID in the message set header directly and avoid the mapping to producer
> ID altogether. As long as batching isn't too bad, the impact on total size
> may not be too bad, but we were ultimately more comfortable with a fixed
> size ID.

...which suggests that the distinction is useful for performance, but not
necessary for correctness, which makes good sense to me. (Would a 128-bid
ID be a reasonable compromise? That's enough room for a UUID, or a
reasonable hash of an arbitrary string, and has only a marginal increase on
the message size.)

On Tue, Dec 6, 2016 at 11:44 PM, Apurva Mehta  wrote:

> Hi Ben,
>
> Now, on to your first question of how deal with consumer rebalances. The
> short answer is that the application needs to ensure that the the
> assignment of input partitions to appId is consistent across rebalances.
>
> For Kafka streams, they already ensure that the mapping of input partitions
> to task Id is invariant across rebalances by implementing a custom sticky
> assignor. Other non-streams apps can trivially have one producer per input
> partition and have the appId be the same as the partition number to achieve
> the same effect.
>
> With this precondition in place, we can maintain transactions across
> rebalances.
>
> Hope this answers your question.
>
> Thanks,
> Apurva
>
> On Tue, Dec 6, 2016 at 3:22 PM, Ben Kirwin  wrote:
>
> > Thanks for this! I'm looking forward to going through the full proposal
> in
> > detail soon; a few early questions:
> >
> > First: what happens when a consumer rebalances in the middle of a
> > transaction? The full documentation suggests that such a transaction
> ought
> > to be rejected:
> >
> > > [...] if a rebalance has happened and this consumer
> > > instance becomes a zombie, even if this offset message is appended in
> the
> > > offset topic, the transaction will be rejected later on when it tries
> to
> > > commit the transaction via the EndTxnRequest.
> >
> > ...but it's unclear to me how we ensure that a transaction can't complete
> > if a rebalance has happened. (It's quite possible I'm missing something
> > obvious!)
> >
> > As a concrete example: suppose a process with PID 1 adds offsets for some
> > partition to a transaction; a consumer rebalance happens that assigns the
> > partition to a process with PID 2, which adds some offsets to its current
> > transaction; both processes try and commit. Allowing both commits would
> > cause the messages to be processed twice -- how is that avoided?
> >
> > Second: App IDs normally map to a single PID. It seems like one could do
> > away with the PID concept entirely, and just use App IDs in most places
> > that require a PID. This feels like it would be significantly simpler,
> > though it does increase the message size. Are there other reasons why the
> > App ID / PID split is necessary?
> >
> > On Wed, Nov 30, 2016 at 2:19 PM, Guozhang Wang 
> wrote:
> >
> > > Hi all,
> > >
> > > I have just created KIP-98 to enhance Kafka with exactly once delivery
> > > semantics:
> > >
> > > *https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > >  > > 98+-+Exactly+Once+Delivery+and+Transactional+Messaging>*
> > >
> > > This KIP adds a transactional messaging mechanism along with an
> > idempotent
> > > producer implementation to make sure that 1) duplicated messages sent
> > from
> > > the same identified producer can be detected on the broker side, and
> 2) a
> > > group of messages sent within a transaction will atomically be either
> > > reflected and fetchable to consumers or not as a whole.
> > >
> > > The above wiki page provides a high-level view of the proposed changes
> as
> > > well as summarized guarantees. Initial draft of the detailed
> > implementation
> > > design is described in this Google doc:
> > >
> > > https://docs.google.com/document/d/11Jqy_
> GjUGtdXJK94XGsEIK7CP1SnQGdp2eF
> > > 0wSw9ra8
> > >
> > >
> > > We would love to hear your comments and suggestions.
> > >
> > > Thanks,
> > >
> > > -- Guozhang
> > >
> >
>


[jira] [Commented] (KAFKA-4526) Transient failure in ThrottlingTest.test_throttled_reassignment

2016-12-13 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747308#comment-15747308
 ] 

Roger Hoover commented on KAFKA-4526:
-

This happened again on the Dec 13 nightly run.

> Transient failure in ThrottlingTest.test_throttled_reassignment
> ---
>
> Key: KAFKA-4526
> URL: https://issues.apache.org/jira/browse/KAFKA-4526
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ewen Cheslack-Postava
>Assignee: Jason Gustafson
>  Labels: system-test-failure, system-tests
> Fix For: 0.10.2.0
>
>
> This test is seeing transient failures sometimes
> {quote}
> Module: kafkatest.tests.core.throttling_test
> Class:  ThrottlingTest
> Method: test_throttled_reassignment
> Arguments:
> {
>   "bounce_brokers": false
> }
> {quote}
> This happens with both bounce_brokers = true and false. Fails with
> {quote}
> AssertionError: 1646 acked message did not make it to the Consumer. They are: 
> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus 
> 1626 more. Total Acked: 174799, Total Consumed: 173153. We validated that the 
> first 1000 of these missing messages correctly made it into Kafka's data 
> files. This suggests they were lost on their way to the consumer.
> {quote}
> See 
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-12--001.1481535295--apache--trunk--62e043a/report.html
>  for an example.
> Note that there are a number of similar bug reports for different tests: 
> https://issues.apache.org/jira/issues/?jql=text%20~%20%22acked%20message%20did%20not%20make%20it%20to%20the%20Consumer%22%20and%20project%20%3D%20Kafka
>  I am wondering if we have a wrong ack setting somewhere that we should be 
> specifying as acks=all but is only defaulting to 0?
> It also seems interesting that the missing messages in these recent failures 
> seem to always start at 0...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:59 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt, we didn't run for a long period on 
0.10.0.0 before upgrading some brokers to 0.10.1.0), is there any possible way 
to downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:58 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:55 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> 

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: kafka-trunk-jdk8 #1102

2016-12-13 Thread Apache Jenkins Server
See 

Changes:

[wangguoz] Separate Streams documentation and setup docs with easy to set 
variables

--
[...truncated 17356 lines...]

org.apache.kafka.streams.KafkaStreamsTest > 
shouldReturnFalseOnCloseWhenThreadsHaventTerminated PASSED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetAllTasksWithStoreWhenNotRunning STARTED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetAllTasksWithStoreWhenNotRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartOnceClosed STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartOnceClosed PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCleanup STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCleanup PASSED

org.apache.kafka.streams.KafkaStreamsTest > testStartAndClose STARTED

org.apache.kafka.streams.KafkaStreamsTest > testStartAndClose PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCloseIsIdempotent STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCloseIsIdempotent PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotCleanupWhileRunning 
STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotCleanupWhileRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartTwice STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartTwice PASSED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[0] STARTED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[0] PASSED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[1] STARTED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[1] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[1] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[1] FAILED
java.lang.AssertionError: Condition not met within timeout 3. waiting 
for store count-by-key
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:283)
at 
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable(QueryableStateIntegrationTest.java:491)

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[1] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[1] PASSED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testShouldReadFromRegexAndNamedTopics STARTED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testShouldReadFromRegexAndNamedTopics PASSED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testRegexMatchesTopicsAWhenCreated STARTED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testRegexMatchesTopicsAWhenCreated PASSED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testMultipleConsumersCanReadFromPartitionedTopic STARTED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testMultipleConsumersCanReadFromPartitionedTopic PASSED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testRegexMatchesTopicsAWhenDeleted STARTED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 
testRegexMatchesTopicsAWhenDeleted PASSED

org.apache.kafka.streams.integration.RegexSourceIntegrationTest > 

Re: [DISCUSS] KIP-48 Support for delegation tokens as an authentication mechanism

2016-12-13 Thread Gwen Shapira
Thinking out loud here:

It looks like authentication with a delegation token is going to be
super-cheap, right? We just compare the token to a value in the broker
cache?

If I understood the KIP correctly, right now it suggests that
authentication happens when establishing the client-broker connection (as
normal for Kafka. But perhaps we want to consider authenticating every
request with delegation token (if exists)?

So a centralized app can create few producers, do the metadata request and
broker discovery with its own user auth, but then use delegation tokens to
allow performing produce/fetch requests as different users? Instead of
having to re-connect for each impersonated user?

This may over-complicate things quite a bit (basically adding extra
information in every request), but maybe it will be useful for
impersonation use-cases (which seem to drive much of the interest in this
KIP)?
Kafka Connect, NiFi and friends can probably use this to share clients
between multiple jobs, tasks, etc.

What do you think?

Gwen

On Tue, Dec 13, 2016 at 12:43 AM, Manikumar 
wrote:

> Ashish,
>
> Thank you for reviewing the KIP.  Please see the replies inline.
>
>
> > 1. How to disable delegation token authentication?
> >
> > This can be achieved in various ways, however I think reusing delegation
> > token secret config for this makes sense here. Avoids creating yet
> another
> > config and forces delegation token users to consciously set the secret.
> If
> > the secret is not set or set to empty string, brokers should turn off
> > delegation token support. This will however require a new error code to
> > indicate delegation token support is turned off on broker.
> >
>
>   Thanks for the suggestion. Option to turnoff delegation token
> authentication will be useful.
>   I'll update the KIP.
>
>
> >
> > 2. ACLs on delegation token?
> >
> > Do we need to have ACLs defined for tokens? I do not think it buys us
> > anything, as delegation token can be treated as impersonation of the
> owner.
> > Any thing the owner has permission to do, delegation tokens should be
> > allowed to do as well. If so, we probably won't need to return
> > authorization exception error code while creating delegation token. It
> > however would make sense to check renew and expire requests are coming
> from
> > owner or renewers of the token, but that does not require explicit acls.
> >
>
>
> Yes, We agreed to not have new acl on who can request delegation token.
>  I'll update the KIP.
>
>
> >
> > 3. How to restrict max life time of a token?
> >
> > Admins might want to restrict max life time of tokens created on a
> cluster,
> > and this can very from cluster to cluster based on use-cases. This might
> > warrant a separate broker config.
> >
> >
> Currently we  have "delegation.token.max.lifetime.sec" server config
> property
> May be we can take min(User supplied MaxTime, Server MaxTime) as max life
> time.
> I am open to add new config property.
>
> Few more comments based on recent KIP update.
> >
> > 1. Do we need a separate {{InvalidateTokenRequest}}? Can't we use
> > {{ExpireTokenRequest}} with with expiryDate set to anything before
> current
> > date?
> >
>
> makes sense. we don't need special request to cancel the token. We can use
> ExpireTokenRequest.
> I'll update the KIP.
>
>
> > 2. Can we change time field names to indicate their unit is milliseconds,
> > like, IssueDateMs, ExpiryDateMs, etc.?
> >
> >
>   Done.
>
>
> > 3. Can we allow users to renew a token for a specified amount of time? In
> > current version of KIP, renew request does not take time as a param, not
> > sure what is expiry time set to after renewal.
> >
> >
>  Yes, we need to specify renew period.  I'll update the KIP.
>
>
> Thanks,
> Mankumar
>
>
>
> >
> > On Mon, Dec 12, 2016 at 9:08 AM Manikumar 
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I would like to reinitiate the discussion on Delegation token support
> for
> > >
> > > Kafka.
> > >
> > >
> > >
> > > Brief summary of the past discussion:
> > >
> > >
> > >
> > > 1) Broker stores delegation tokens in zookeeper.  All brokers will
> have a
> > >
> > > cache backed by
> > >
> > >zookeeper so they will all get notified whenever a new token is
> > >
> > > generated and they will
> > >
> > >update their local cache whenever token state changes.
> > >
> > > 2) The current proposal does not support rotation of secret
> > >
> > > 3) Only allow the renewal by users that authenticated using *non*
> > >
> > > delegation token mechanism
> > >
> > > 4) KIP-84 proposes to support  SASL SCRAM mechanisms. Kafka clients can
> > >
> > > authenticate using
> > >
> > >SCRAM-SHA-256, providing the delegation token HMAC as password.
> > >
> > >
> > >
> > > Updated the KIP with the following:
> > >
> > > 1. Protocol and Config changes
> > >
> > > 2. format of the data stored in ZK.
> > >
> > > 3. Changes to Java Clients/Usage of SASL SCRAM mechanism
> > 

Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Neha Narkhede
Makes sense :)

On Tue, Dec 13, 2016 at 5:50 PM Jay Kreps  wrote:

> Ha, least controversial KIP ever. :-)
>
> -Jay
>
> On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
> wrote:
>
> > Hi all,
> >
> > The following is a KIP for removing zkClient dependency from Streams.
> > Please check out the KIP page:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 90+-+Remove+zkClient+dependency+from+Streams
> >
> > Thanks,
> > --Hojjat
> >
>
-- 
Thanks,
Neha


Re: Website Update, Part 2

2016-12-13 Thread Gwen Shapira
Hi,

Since we are breaking down the docs, we can no longer use ctrl-f to find
where to find specific things we are looking for... maybe it is time to add
a site search bar? I think google has something we can embed.

On Tue, Dec 13, 2016 at 6:12 PM, Guozhang Wang  wrote:

> Folks,
>
> We are continuing to improve our website, and one of it is to break the
> single gigantic "documentation" page:
>
> https://kafka.apache.org/documentation/
>
> into sub-spaces and sub-pages for better visibility. As the first step of
> this effort, we will be gradually extract each section of this page into a
> separate page and then grow each one of them in their own sub-space.
>
> As of now, we have extract Streams section out of documentation as
>
> https://kafka.apache.org/documentation/streams
>
> while all the existing hashtags are preserved and re-directed via JS (many
> thanks to Derrick!) so that we do not loose any SEO. At the same time I
> have updated the "website doc contributions" wiki a bit with guidance on
> locally displaying and debugging doc changes with this refactoring:
>
> https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Website+
> Documentation+Changes
>
>
> We are trying to do the same for Connect, Ops, Configs, APIs etc in the
> near future. Any comments, improvements, and contributions are welcome and
> encouraged.
>
>
> --
> -- Guozhang
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter  | blog



Website Update, Part 2

2016-12-13 Thread Guozhang Wang
Folks,

We are continuing to improve our website, and one of it is to break the
single gigantic "documentation" page:

https://kafka.apache.org/documentation/

into sub-spaces and sub-pages for better visibility. As the first step of
this effort, we will be gradually extract each section of this page into a
separate page and then grow each one of them in their own sub-space.

As of now, we have extract Streams section out of documentation as

https://kafka.apache.org/documentation/streams

while all the existing hashtags are preserved and re-directed via JS (many
thanks to Derrick!) so that we do not loose any SEO. At the same time I
have updated the "website doc contributions" wiki a bit with guidance on
locally displaying and debugging doc changes with this refactoring:

https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Website+Documentation+Changes


We are trying to do the same for Connect, Ops, Configs, APIs etc in the
near future. Any comments, improvements, and contributions are welcome and
encouraged.


-- 
-- Guozhang


Re: [DISCUSS] KIP-95: Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Matthias J. Sax
Avi,

thanks for your feedback. We want to enlarge the scope for Streams
application and started to collect use cases in the Wiki:

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios

Feel free to add there via editing the page or writing a comment.


-Matthias

On 12/13/16 3:30 PM, Avi Flax wrote:
> On 2016-11-28 13:47 (-0500), "Matthias J. Sax"  wrote: 
>>
>> I want to start a discussion about KIP-95:
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams
>>
>> Looking forward to your feedback.
> 
> Hi Matthias,
> 
> I’d just like to share some feedback on this proposal, from the perspective
> of a Kafka user rather than developer:
> 
> * Overall, makes a ton of sense and I think it’ll be very useful and will
>   open up a bunch of additional use cases to Kafka Streams
> 
> * Two use cases that are of particular interest to me:
> 
> * Just last week I created a simple Kafka Streams app (I called it a
>   “script”) to copy certain records from one topic over to another,
>   with filtering. I ran the app/script until it reached the end of
>   the topic, then I manually shut it down with ctrl-c. This worked 
>   fine, but it would have been an even better UX to have specified
>   the config value `autostop.at` as `eol` and have the process stop
>   itself at the desired point. That would have required less manual
>   monitoring on my part.
> 
> * With this new mode I might be able to run Streams apps on AWS Lambda.
>   I’ve been super-excited about Lambda and similar FaaS services since
>   their early days, and I’ve been itching to run Kafka Streams apps on
>   Lambda for since I started using Streams in April or May. Unfortunately,
>   Lambda functions are limited to 5 minutes per invocation — after 5
>   minutes they’re killed. I’m not sure, but I wonder if perhaps this new
>   autostop.at feature could make it more practical to run a Streams
>   app on Lambda - the feature seems like it could potentially be adapted
>   to enable a Streams app to be more generally resilient to being
>   frequently stopped and started.
> 
> I look forward to seeing progress on this enhancement!
> 
> Thanks,
> Avi
> 
> 
> Software Architect @ Park Assist
> We’re hiring! http://tech.parkassist.com/jobs/
> 



signature.asc
Description: OpenPGP digital signature


[GitHub] kafka pull request #2245: Separate Streams documentation and setup docs with...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2245


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] KIP-82 - Add Record Headers

2016-12-13 Thread Matthias J. Sax
Hi,

I want to add a completely new angle to this discussion. For this, I
want to propose an extension for the headers feature that enables new
uses cases -- and those new use cases might convince people to support
headers (of course including the larger scoped proposal).

Extended Proposal:

Allow messages with a certain header key to be special "control
messages" (w/ o w/o payload) that are not exposed to an application via
.poll().

Thus, a consumer client would automatically skip over those messages. If
an application knows about embedded control messages, it can "sing up"
to those messages by the consumer client and either get a callback or
the consumer auto-drop for this messages gets disabled (allowing to
consumer those messages via poll()).

(The details need further considerations/discussion. I just want to
sketch the main idea.)

Usage:

There is a shared topic (ie, used by multiple applications) and a
producer application wants to embed a special message in the topic for a
dedicated consumer application. Because only one application will
understand this message, it cannot be a regular message as this would
break all applications that do not understand this message. The producer
application would set a special metadata key and no consumer application
would see this control message by default because they did not enable
their consumer client to return this message in poll() (and the client
would just drop this message with special metadata key). Only the single
application that should receive this message, will subscribe to this
message on its consumer client and process it.


Concrete Use Case: Kafka Streams

In Kafka Streams, we would like to propagate "control messages" from
subtopology to subtopology. There are multiple scenarios for which this
would be useful. For example, currently we do not guarantee a
"consistent shutdown" of an application. By this, I mean that input
records might not be completely processed by the whole topology because
the application shutdown happens "in between" and an intermediate result
topic gets "stock" in an intermediate topic. Thus, a user would see an
committed offset of the source topic of the application, but no
corresponding result record in the output topic.

Having "shutdown markers" would allow us, to first stop the upstream
subtopology and write this marker into the intermediate topic and the
downstream subtopology would only shut down itself after is sees the
"shutdown marker". Thus, we can guarantee on shutdown, that no
"in-flight" messages got stuck in intermediate topics.


A similar usage would be for KIP-95 (Incremental Batch Processing).
There was a discussion about the proposed metadata topic, and we could
avoid this metadata topic if we would have "control messages".


Right now, we cannot insert an "application control message" because
Kafka Streams does not own all topics it read/writes and thus might
break other consumer application (as described above) if we inject
random messages that are not understood by other apps.


Of course, one can work around "embedded control messaged" by using an
additional topic to propagate control messaged between application (as
suggestion in KIP-95 via a metadata topic for Kafka Streams). But there
are major concerns about adding this metadata topic in the KIP and this
shows that other application that need a similar pattern might profit
from topic embedded "control messages", too.


One last important consideration: those "control messages" are used for
client to client communication and are not understood by the broker.
Thus, those messages should not be enabled within the message format
(c.f. tombstone flag -- KIP-87). However, "client land" record headers
would be a nice way to implement them. Because KIP-82 did consider key
namespaces for metatdata keys, this extension should not be an own KIP
but should be included in KIP-82 to reserve a namespace for "control
message" in the first place.


Sorry for the long email... Looking forward to your feedback.


-Matthias









On 12/8/16 12:12 AM, Michael Pearce wrote:
> Hi Jun
> 
> 100) each time a transaction exits a jvm for a remote system (HTTP/JMS/ 
> Hopefully one day kafka) the APM tools stich in a unique id (though I believe 
> it contains the end2end uuid embedded in this id), on receiving the message 
> at the receiving JVM the apm code takes this out, and continues its tracing 
> on the that new thread. Both JVM’s (and other languages the APM tool 
> supports) send this data async back to the central controllers where the 
> stiching togeather occurs. For this they need some header space for them to 
> put this id.
> 
> 101) Yes indeed we have a business transaction Id in the payload. Though this 
> is a system level tracing, that we need to have marry up. Also as per note on 
> end2end encryption we’d be unable to prove the flow if the payload is 
> encrypted as we’d not have access to this at certain points of the flow 
> through the 

[jira] [Commented] (KAFKA-4505) Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0

2016-12-13 Thread huxi (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746918#comment-15746918
 ] 

huxi commented on KAFKA-4505:
-

I don't think it's a good solution. The path you created in Zookeeper should 
have been an ephemeral znode which is used to dynamically marks the ownership 
of a consumer instance, but it would have not been able to do this if it were a 
persistent znode. 

Could you help confirm this: 
1. First remove the znodes you manually created by rmr 
/cosumers/log_level_export2_group_deduplicating/ids and rmr 
/cosumers/log_level_export2_group_deduplicating/owners
2. Run the old consumer with the same group id 'rmr 
/cosumers/log_level_export2_group_deduplicating/ids' and keep it running
3. Check whether ids and owners have been created under 
/consumers/log_level_export2_group_deduplicating

> Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0
> -
>
> Key: KAFKA-4505
> URL: https://issues.apache.org/jira/browse/KAFKA-4505
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, metrics, offset manager
>Affects Versions: 0.10.1.0
>Reporter: Romaric Parmentier
>Priority: Critical
>
> We were using kafka 0.8.1.1 and we just migrate to version 0.10.1.0.
> Since we migrate we are using the new script kafka-consumer-groups.sh to 
> retreive topic lags but it don't seem to work anymore. 
> Because the application is using the 0.8 driver we have added the following 
> conf to each kafka servers:
> log.message.format.version=0.8.2
> inter.broker.protocol.version=0.10.0.0
> When I'm using the option --list with kafka-consumer-groups.sh I can see 
> every consumer groups I'm using but the --describe is not working:
> /usr/share/kafka$ bin/kafka-consumer-groups.sh --zookeeper ip:2181 --describe 
> --group group_name
> No topic available for consumer group provided
> GROUP  TOPIC  PARTITION  
> CURRENT-OFFSET  LOG-END-OFFSET  LAG OWNER
> When I'm looking into zookeeper I can see the offset increasing for this 
> consumer group.
> Any idea ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka-site issue #17: Update quickstart guides with a note to refer to the c...

2016-12-13 Thread gwenshap
Github user gwenshap commented on the issue:

https://github.com/apache/kafka-site/pull/17
  
This PR is slightly older and I think we added the warning while re-working 
the site based on this PR (Thanks @cptcanuck!).
I think it can be closed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Jay Kreps
Ha, least controversial KIP ever. :-)

-Jay

On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
wrote:

> Hi all,
>
> The following is a KIP for removing zkClient dependency from Streams.
> Please check out the KIP page:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 90+-+Remove+zkClient+dependency+from+Streams
>
> Thanks,
> --Hojjat
>


[jira] [Commented] (KAFKA-1696) Kafka should be able to generate Hadoop delegation tokens

2016-12-13 Thread Ashish K Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746886#comment-15746886
 ] 

Ashish K Singh commented on KAFKA-1696:
---

[~omkreddy] as I indicated on discuss list, I have made some progress along 
this to put together a POC. I would love if my work can help us speed up here. 
Do you think it is OK, if I propose some sub-jiras and add PRs over there. 
There is a lot still left to do, but I can quickly throw some initial pieces 
that will help us with start. Makes sense?

> Kafka should be able to generate Hadoop delegation tokens
> -
>
> Key: KAFKA-1696
> URL: https://issues.apache.org/jira/browse/KAFKA-1696
> Project: Kafka
>  Issue Type: Sub-task
>  Components: security
>Reporter: Jay Kreps
>Assignee: Parth Brahmbhatt
>
> For access from MapReduce/etc jobs run on behalf of a user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka-site issue #36: Streams standalone docs

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/36
  
Merged, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #36: Streams standalone docs

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka-site/pull/36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-4030) Update older quickstart documents to clarify which version they relate to

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746872#comment-15746872
 ] 

ASF GitHub Bot commented on KAFKA-4030:
---

Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/17
  
@cptcanuck Could you rebase this PR then I will merge as is? Thanks. Also 
could you update the PR title as

```
KAFKA-4030: Update
```

since it will be used as the commit message.


> Update older quickstart documents to clarify which version they relate to
> -
>
> Key: KAFKA-4030
> URL: https://issues.apache.org/jira/browse/KAFKA-4030
> Project: Kafka
>  Issue Type: Improvement
>  Components: website
>Reporter: Todd Snyder
>  Labels: documentation, website
>
> If you search for 'kafka quickstart' it takes you to 
> kafka.apache.org/07/quickstart.html which is, unclearly, for release 0.7 and 
> not the current release.
> [~gwenshap] suggested a ticket and a note added to the 0.7 (and likely 0.8 
> and 0.9) quickstart guides directing people to use ~current for the latest 
> release documentation.
> I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka-site issue #17: Update quickstart guides with a note to refer to the c...

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/17
  
@cptcanuck Could you rebase this PR then I will merge as is? Thanks. Also 
could you update the PR title as

```
KAFKA-4030: Update
```

since it will be used as the commit message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #37: Add new meetup link on events page

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/37
  
Already reflected on the web site: https://kafka.apache.org/events cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #37: Add new meetup link on events page

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka-site/pull/37


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #37: Add new meetup link on events page

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/37
  
Great to hear about meetup in Japan! Merged this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #34: Fix typo on introduction page

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/34
  
Thanks @ashishg-qburst , could you file the PR against kafka repo itself on 
this file: https://github.com/apache/kafka/blob/trunk/docs/introduction.html

Since it will be copied to kafka-site upon release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #33: Update contact.html

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/33
  
Thanks @dossett , merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #33: Update contact.html

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka-site/pull/33


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #23: Grammar updated

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/23
  
Thanks @muehlburger , merged. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #23: Grammar updated

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka-site/pull/23


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site issue #18: Implementation: Clean-up invalid HTML

2016-12-13 Thread guozhangwang
Github user guozhangwang commented on the issue:

https://github.com/apache/kafka-site/pull/18
  
Thanks @epeay , could you file the PR against 
https://github.com/apache/kafka/blob/trunk/docs/implementation.html instead 
since we will periodically move the docs when doing the release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-3869) Fix streams example

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746808#comment-15746808
 ] 

ASF GitHub Bot commented on KAFKA-3869:
---

Github user sakamotomsh closed the pull request at:

https://github.com/apache/kafka/pull/1524


> Fix streams example
> ---
>
> Key: KAFKA-3869
> URL: https://issues.apache.org/jira/browse/KAFKA-3869
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.10.0.0
>Reporter: Masahiko Sakamoto
>Priority: Minor
>
> kafka streams wordcount example is outdated because KStream has no method 
> 'groupByKey', to be fixed by using 'countByKey'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #1524: KAFKA-3869 Fix streams example

2016-12-13 Thread sakamotomsh
Github user sakamotomsh closed the pull request at:

https://github.com/apache/kafka/pull/1524


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (KAFKA-3869) Fix streams example

2016-12-13 Thread Masahiko Sakamoto (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masahiko Sakamoto closed KAFKA-3869.


> Fix streams example
> ---
>
> Key: KAFKA-3869
> URL: https://issues.apache.org/jira/browse/KAFKA-3869
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.10.0.0
>Reporter: Masahiko Sakamoto
>Priority: Minor
>
> kafka streams wordcount example is outdated because KStream has no method 
> 'groupByKey', to be fixed by using 'countByKey'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746799#comment-15746799
 ] 

Apurva Mehta commented on KAFKA-4477:
-

[~tdevoe], thanks for sharing all your extend broker logs, as well as the 
controller and state change logs. 

I have a few questions: 

# The original description in the ticket stats that the problem node reduces 
the ISR to itself, and then doesn't recover. In the logs you shared, the 
problem node 1002 does shrink its ISRs to itself, but then the ISR begins to 
expand back to the original set only 2 seconds after. The broker log for node 
1002 also shows connections from the other replicas coming in. We can tell 
since the SASL handshake is being logged. the strange bit is that nodes 1001 
and nodes 1003, however,  can't seem to connect until 2130, which brings me to 
my next point.
# Did you bounce the hosts at 2130? If not, when were the hosts bounced?
# We have fixed some deadlock bugs where the ISR shrinks to a single node but 
expands back again. Given the observation in point 1, it maybe worth trying the 
0.10.1.1 RC to see if you can reproduce this problem when using that code. If 
it reproduces, then we know for certain that the existing deadlocks are not the 
issue.
# Another suspicion we have is the changes to the `NetworkClientBlockingOps` 
code. However, this code does not have any logging. If you try the RC, and 
still hit the issue, would you be willing to deploy a version of 0.10.1 with 
some instrumentation around the network client code. This would enable us to 
validate or disprove our hypothesis.

Thanks,
Apurva

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Kafka ACL's with SSL Protocol is not working

2016-12-13 Thread Raghu B
Hi All,

I am trying to enable ACL's in my Kafka cluster with along with SSL
Protocol.

I tried with each and every parameters but no luck, so I need help to
enable the SSL(without Kerberos) and I am attaching all the configuration
details in this.

Kindly Help me.


*I tested SSL without ACL, it worked fine
(listeners=SSL://10.247.195.122:9093 )*


*This is my Kafka server properties file:*

*# ACL SETTINGS #*

*auto.create.topics.enable=true*

*authorizer.class.name
=kafka.security.auth.SimpleAclAuthorizer*

*security.inter.broker.protocol=SSL*

*#allow.everyone.if.no.acl.found=true*

*#principal.builder.class=CustomizedPrincipalBuilderClass*

*#super.users=User:"CN=writeuser,OU=Unknown,O=Unknown,L=Unknown,ST=Unknown,C=Unknown"*

*#super.users=User:Raghu;User:Admin*

*#offsets.storage=kafka*

*#dual.commit.enabled=true*

*listeners=SSL://10.247.195.122:9093 *

*#listeners=PLAINTEXT://10.247.195.122:9092 *

*#listeners=PLAINTEXT://10.247.195.122:9092
,SSL://10.247.195.122:9093
*

*#advertised.listeners=PLAINTEXT://10.247.195.122:9092
*


*
ssl.keystore.location=/home/raghu/kafka/security/server.keystore.jks*

*ssl.keystore.password=123456*

*ssl.key.password=123456*

*
ssl.truststore.location=/home/raghu/kafka/security/server.truststore.jks*

*ssl.truststore.password=123456*



*Set the ACL from Authorizer CLI:*

> *bin/kafka-acls.sh --authorizer-properties
zookeeper.connect=10.247.195.122:2181  --list
--topic ssltopic*

*Current ACLs for resource `Topic:ssltopic`: *

*  User:CN=writeuser, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown,
C=Unknown has Allow permission for operations: Write from hosts: * *


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ bin/kafka-console-producer.sh
--broker-list 10.247.195.122:9093  --topic
ssltopic --producer.config client-ssl.properties*


*[2016-12-13 14:53:45,839] WARN Error while fetching metadata with
correlation id 0 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*

*[2016-12-13 14:53:45,984] WARN Error while fetching metadata with
correlation id 1 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ cat client-ssl.properties*

*#group.id =sslgroup*

*security.protocol=SSL*

*ssl.truststore.location=/Users/rbaddam/Desktop/Dev/kafka_2.11-0.10.1.0/ssl/client.truststore.jks*

*ssl.truststore.password=123456*

* #Configure Below if you use Client Auth*


*ssl.keystore.location=/Users/rbaddam/Desktop/Dev/kafka_2.11-0.10.1.0/ssl/client.keystore.jks*

*ssl.keystore.password=123456*

*ssl.key.password=123456*


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ bin/kafka-console-consumer.sh
--bootstrap-server 10.247.195.122:9093 
--new-consumer --consumer.config client-ssl.properties --topic ssltopic
--from-beginning*

*[2016-12-13 14:53:28,817] WARN Error while fetching metadata with
correlation id 1 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*

*[2016-12-13 14:53:28,819] ERROR Unknown error when running consumer:
(kafka.tools.ConsoleConsumer$)*

*org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized
to access group: console-consumer-52826*


Thanks in advance,

Raghu - raghu98...@gmail.com


Re: [DISCUSS] KIP-95: Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Avi Flax
On 2016-11-28 13:47 (-0500), "Matthias J. Sax"  wrote: 
>
> I want to start a discussion about KIP-95:
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams
> 
> Looking forward to your feedback.

Hi Matthias,

I’d just like to share some feedback on this proposal, from the perspective
of a Kafka user rather than developer:

* Overall, makes a ton of sense and I think it’ll be very useful and will
  open up a bunch of additional use cases to Kafka Streams

* Two use cases that are of particular interest to me:

* Just last week I created a simple Kafka Streams app (I called it a
  “script”) to copy certain records from one topic over to another,
  with filtering. I ran the app/script until it reached the end of
  the topic, then I manually shut it down with ctrl-c. This worked 
  fine, but it would have been an even better UX to have specified
  the config value `autostop.at` as `eol` and have the process stop
  itself at the desired point. That would have required less manual
  monitoring on my part.

* With this new mode I might be able to run Streams apps on AWS Lambda.
  I’ve been super-excited about Lambda and similar FaaS services since
  their early days, and I’ve been itching to run Kafka Streams apps on
  Lambda for since I started using Streams in April or May. Unfortunately,
  Lambda functions are limited to 5 minutes per invocation — after 5
  minutes they’re killed. I’m not sure, but I wonder if perhaps this new
  autostop.at feature could make it more practical to run a Streams
  app on Lambda - the feature seems like it could potentially be adapted
  to enable a Streams app to be more generally resilient to being
  frequently stopped and started.

I look forward to seeing progress on this enhancement!

Thanks,
Avi


Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/

Jenkins build is back to normal : kafka-trunk-jdk8 #1101

2016-12-13 Thread Apache Jenkins Server
See 



[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Lorand Peter Kasler (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746479#comment-15746479
 ] 

Lorand Peter Kasler commented on KAFKA-4477:


Yes, we recently upgraded and started getting these weird issues, approximately 
once or twice a week. 
The previous version was: 0.10.0.0

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746466#comment-15746466
 ] 

Ismael Juma commented on KAFKA-4477:


One question for the others that have reported this issue: have you upgraded to 
0.10.1.0 and starting seeing these issues (like Michael)? And if so, which 
version of Kafka were you running before the upgrade?

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : kafka-trunk-jdk7 #1753

2016-12-13 Thread Apache Jenkins Server
See 



[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746402#comment-15746402
 ] 

Tom DeVoe commented on KAFKA-4477:
--

I was able to telnet from the controller node to the kafka port on node 1002 
while it was saying it was unable to connect, so the network seemed to fine at 
the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746403#comment-15746403
 ] 

Tom DeVoe commented on KAFKA-4477:
--

I was able to telnet from the controller node to the kafka port on node 1002 
while it was saying it was unable to connect, so the network seemed to fine at 
the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom DeVoe updated KAFKA-4477:
-
Comment: was deleted

(was: I was able to telnet from the controller node to the kafka port on node 
1002 while it was saying it was unable to connect, so the network seemed to 
fine at the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk))

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746393#comment-15746393
 ] 

Jun Rao commented on KAFKA-4477:


[~tdevoe], from the controller log, starting from 19:59:13, the controller 
keeps failing connecting to broker 1002. What 1002 up at that point?

[2016-11-28 19:59:13,140] WARN [Controller-1003-to-broker-1002-send-thread], 
Controller 1003 epoch 23 fails to send request 
{controller_id=1003,controller_epoch=23,partition_states=[{topic=__consumer_offsets,partition=18,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=73,replicas=[1002,1001,1003]},{topic=__consumer_offsets,partition=45,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=68,replicas=[1002,1003,1001]},{topic=topic_23,partition=0,controller_epoch=22,leader=1002,leader_epoch=10,isr=[1002],zk_version=30,replicas=[1002,1003,1001]},{topic=topic_22,partition=2,controller_epoch=22,leader=1002,leader_epoch=10,isr=[1002],zk_version=26,replicas=[1002,1003,1001]},{topic=topic_3,partition=0,controller_epoch=22,leader=1002,leader_epoch=12,isr=[1002],zk_version=33,replicas=[1002,1001,1003]},{topic=__consumer_offsets,partition=36,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=72,replicas=[1002,1001,1003]},{topic=connect-offsets,partition=13,controller_epoch=22,leader=1002,leader_epoch=12,isr=[1002],zk_version=37,replicas=[1002,1001,1003]}],live_brokers=[{id=1003,end_points=[{port=9092,host=node_1003,security_protocol_type=2}],rack=null},{id=1002,end_points=[{port=9092,host=node_1002,security_protocol_type=2}],rack=null},{id=1001,end_points=[{port=9092,host=node_1001,security_protocol_type=2}],rack=null}]}
 to broker node_1002:9092 (id: 1002 rack: null). Reconnecting to broker. 
(kafka.controller.RequestSendThread)
java.io.IOException: Connection to 1002 was disconnected before the response 
was read
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:190)
at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:181)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)


> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any 

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Lorand Peter Kasler (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746368#comment-15746368
 ] 

Lorand Peter Kasler commented on KAFKA-4477:


We had encountered the same situation (even having this constantly growing file 
handle usage mentioned earlier) and the Follower nodes were trying continuously 
to connect (returning the same error every time that has been posted). 
Also the (possibly deadlocked) leader node refused to export any metrics, but 
before the incident the request latency wasn't at it's peak and it wasn't the 
highest in the cluster. 

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Ismael Juma
Thanks for the KIP, Hojjat. It will be great for Streams apps not to
require ZK access.

Ismael

On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
wrote:

> Hi all,
>
> The following is a KIP for removing zkClient dependency from Streams.
> Please check out the KIP page:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 90+-+Remove+zkClient+dependency+from+Streams
>
> Thanks,
> --Hojjat
>


Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Gwen Shapira
Great idea, go for it :)

On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
wrote:

> Hi all,
>
> The following is a KIP for removing zkClient dependency from Streams.
> Please check out the KIP page:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 90+-+Remove+zkClient+dependency+from+Streams
>
> Thanks,
> --Hojjat
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter  | blog



[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746249#comment-15746249
 ] 

ASF GitHub Bot commented on KAFKA-4497:
---

Github user becketqin closed the pull request at:

https://github.com/apache/kafka/pull/2242


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
> at 

[GitHub] kafka pull request #2242: KAFKA-4497: Fix the ByteBufferMessageSet.filterInt...

2016-12-13 Thread becketqin
Github user becketqin closed the pull request at:

https://github.com/apache/kafka/pull/2242


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Build failed in Jenkins: kafka-trunk-jdk7 #1752

2016-12-13 Thread Apache Jenkins Server
See 

Changes:

[wangguoz] MINOR: Fix Streams examples in documentation

--
[...truncated 6277 lines...]

kafka.api.PlaintextProducerSendTest > testSendOffset STARTED

kafka.api.PlaintextProducerSendTest > testSendOffset PASSED

kafka.api.PlaintextProducerSendTest > testSendCompressedMessageWithCreateTime 
STARTED

kafka.api.PlaintextProducerSendTest > testSendCompressedMessageWithCreateTime 
PASSED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromCallerThread 
STARTED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromCallerThread 
PASSED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromSenderThread 
STARTED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromSenderThread 
PASSED

kafka.api.PlaintextProducerSendTest > testSendBeforeAndAfterPartitionExpansion 
STARTED

kafka.api.PlaintextProducerSendTest > testSendBeforeAndAfterPartitionExpansion 
PASSED

kafka.api.PlaintextConsumerTest > testEarliestOrLatestOffsets STARTED

kafka.api.PlaintextConsumerTest > testEarliestOrLatestOffsets PASSED

kafka.api.PlaintextConsumerTest > testPartitionsForAutoCreate STARTED

kafka.api.PlaintextConsumerTest > testPartitionsForAutoCreate PASSED

kafka.api.PlaintextConsumerTest > testShrinkingTopicSubscriptions STARTED

kafka.api.PlaintextConsumerTest > testShrinkingTopicSubscriptions PASSED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMs STARTED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMs PASSED

kafka.api.PlaintextConsumerTest > testOffsetsForTimes STARTED

kafka.api.PlaintextConsumerTest > testOffsetsForTimes PASSED

kafka.api.PlaintextConsumerTest > testSubsequentPatternSubscription STARTED

kafka.api.PlaintextConsumerTest > testSubsequentPatternSubscription PASSED

kafka.api.PlaintextConsumerTest > testAsyncCommit STARTED

kafka.api.PlaintextConsumerTest > testAsyncCommit PASSED

kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition 
STARTED

kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition 
PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnStopPolling 
STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnStopPolling 
PASSED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMsDelayInRevocation STARTED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMsDelayInRevocation PASSED

kafka.api.PlaintextConsumerTest > testPartitionsForInvalidTopic STARTED

kafka.api.PlaintextConsumerTest > testPartitionsForInvalidTopic PASSED

kafka.api.PlaintextConsumerTest > testPauseStateNotPreservedByRebalance STARTED

kafka.api.PlaintextConsumerTest > testPauseStateNotPreservedByRebalance PASSED

kafka.api.PlaintextConsumerTest > 
testFetchHonoursFetchSizeIfLargeRecordNotFirst STARTED

kafka.api.PlaintextConsumerTest > 
testFetchHonoursFetchSizeIfLargeRecordNotFirst PASSED

kafka.api.PlaintextConsumerTest > testSeek STARTED

kafka.api.PlaintextConsumerTest > testSeek PASSED

kafka.api.PlaintextConsumerTest > testPositionAndCommit STARTED

kafka.api.PlaintextConsumerTest > testPositionAndCommit PASSED

kafka.api.PlaintextConsumerTest > 
testFetchRecordLargerThanMaxPartitionFetchBytes STARTED

kafka.api.PlaintextConsumerTest > 
testFetchRecordLargerThanMaxPartitionFetchBytes PASSED

kafka.api.PlaintextConsumerTest > testUnsubscribeTopic STARTED

kafka.api.PlaintextConsumerTest > testUnsubscribeTopic PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnClose STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnClose PASSED

kafka.api.PlaintextConsumerTest > testFetchRecordLargerThanFetchMaxBytes STARTED

kafka.api.PlaintextConsumerTest > testFetchRecordLargerThanFetchMaxBytes PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerDefaultAssignment STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerDefaultAssignment PASSED

kafka.api.PlaintextConsumerTest > testAutoCommitOnClose STARTED

kafka.api.PlaintextConsumerTest > testAutoCommitOnClose PASSED

kafka.api.PlaintextConsumerTest > testListTopics STARTED

kafka.api.PlaintextConsumerTest > testListTopics PASSED

kafka.api.PlaintextConsumerTest > testExpandingTopicSubscriptions STARTED

kafka.api.PlaintextConsumerTest > testExpandingTopicSubscriptions PASSED

kafka.api.PlaintextConsumerTest > testInterceptors STARTED

kafka.api.PlaintextConsumerTest > testInterceptors PASSED

kafka.api.PlaintextConsumerTest > testPatternUnsubscription STARTED

kafka.api.PlaintextConsumerTest > testPatternUnsubscription PASSED

kafka.api.PlaintextConsumerTest > testGroupConsumption STARTED

kafka.api.PlaintextConsumerTest > testGroupConsumption PASSED

kafka.api.PlaintextConsumerTest > testPartitionsFor STARTED

kafka.api.PlaintextConsumerTest > testPartitionsFor PASSED

kafka.api.PlaintextConsumerTest > testAutoCommitOnRebalance STARTED


[GitHub] kafka pull request #2252: HOTFIX: fix state transition stuck on rebalance

2016-12-13 Thread enothereska
GitHub user enothereska opened a pull request:

https://github.com/apache/kafka/pull/2252

HOTFIX: fix state transition stuck on rebalance

This fixes a problem where the Kafka instance state transition gets stuck 
on rebalance. Also adjusts the test in QueryableStateIntegration test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/enothereska/kafka hotfix_state_never_running

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2252


commit 0476125a19b9e824d8cdd181149a1a038d52b7c5
Author: Eno Thereska 
Date:   2016-12-13T20:50:13Z

Fixed state stuck on rebalance

commit 67afcf855af702b4aedafc8f403b7ac5cb82e080
Author: Eno Thereska 
Date:   2016-12-13T20:51:11Z

Merge remote-tracking branch 'origin/trunk' into hotfix_state_never_running




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Build failed in Jenkins: kafka-trunk-jdk8 #1100

2016-12-13 Thread Apache Jenkins Server
See 

Changes:

[jason] KAFKA-4390; Replace MessageSet usage with client-side alternatives

--
[...truncated 7920 lines...]

kafka.utils.timer.TimerTaskListTest > testAll STARTED

kafka.utils.timer.TimerTaskListTest > testAll PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg PASSED

kafka.utils.CommandLineUtilsTest > testParseSingleArg STARTED

kafka.utils.CommandLineUtilsTest > testParseSingleArg PASSED

kafka.utils.CommandLineUtilsTest > testParseArgs STARTED

kafka.utils.CommandLineUtilsTest > testParseArgs PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid PASSED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr STARTED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr PASSED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition STARTED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition PASSED

kafka.utils.JsonTest > testJsonEncoding STARTED

kafka.utils.JsonTest > testJsonEncoding PASSED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask PASSED

kafka.utils.SchedulerTest > testNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testRestart STARTED

kafka.utils.SchedulerTest > testRestart PASSED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler STARTED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler PASSED

kafka.utils.SchedulerTest > testPeriodicTask STARTED

kafka.utils.SchedulerTest > testPeriodicTask PASSED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing STARTED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing PASSED

kafka.utils.IteratorTemplateTest > testIterator STARTED

kafka.utils.IteratorTemplateTest > testIterator PASSED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 STARTED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 PASSED

kafka.utils.UtilsTest > testAbs STARTED

kafka.utils.UtilsTest > testAbs PASSED

kafka.utils.UtilsTest > testReplaceSuffix STARTED

kafka.utils.UtilsTest > testReplaceSuffix PASSED

kafka.utils.UtilsTest > testCircularIterator STARTED

kafka.utils.UtilsTest > testCircularIterator PASSED

kafka.utils.UtilsTest > testReadBytes STARTED

kafka.utils.UtilsTest > testReadBytes PASSED

kafka.utils.UtilsTest > testCsvList STARTED

kafka.utils.UtilsTest > testCsvList PASSED

kafka.utils.UtilsTest > testReadInt STARTED

kafka.utils.UtilsTest > testReadInt PASSED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID STARTED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID PASSED

kafka.utils.UtilsTest > testCsvMap STARTED

kafka.utils.UtilsTest > testCsvMap PASSED

kafka.utils.UtilsTest > testInLock STARTED

kafka.utils.UtilsTest > testInLock PASSED

kafka.utils.UtilsTest > testSwallow STARTED

kafka.utils.UtilsTest > testSwallow PASSED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic STARTED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic PASSED

kafka.producer.AsyncProducerTest > testQueueTimeExpired STARTED

kafka.producer.AsyncProducerTest > testQueueTimeExpired PASSED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents STARTED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents PASSED

kafka.producer.AsyncProducerTest > testBatchSize STARTED

kafka.producer.AsyncProducerTest > testBatchSize PASSED

kafka.producer.AsyncProducerTest > testSerializeEvents STARTED

kafka.producer.AsyncProducerTest > testSerializeEvents PASSED

kafka.producer.AsyncProducerTest > testProducerQueueSize STARTED

kafka.producer.AsyncProducerTest > testProducerQueueSize PASSED

kafka.producer.AsyncProducerTest > testRandomPartitioner STARTED

kafka.producer.AsyncProducerTest > testRandomPartitioner PASSED

kafka.producer.AsyncProducerTest > testInvalidConfiguration STARTED

kafka.producer.AsyncProducerTest > testInvalidConfiguration PASSED

kafka.producer.AsyncProducerTest > testInvalidPartition STARTED

kafka.producer.AsyncProducerTest > testInvalidPartition PASSED

kafka.producer.AsyncProducerTest > testNoBroker STARTED

kafka.producer.AsyncProducerTest > testNoBroker PASSED

kafka.producer.AsyncProducerTest > testProduceAfterClosed STARTED

kafka.producer.AsyncProducerTest > testProduceAfterClosed PASSED

kafka.producer.AsyncProducerTest 

[jira] [Commented] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746201#comment-15746201
 ] 

ASF GitHub Bot commented on KAFKA-4532:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2250


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2250: KAFKA-4532: StateStores can be connected to the wr...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2250


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-4532.
--
Resolution: Fixed

Issue resolved by pull request 2250
[https://github.com/apache/kafka/pull/2250]

> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4509) Task reusage on rebalance fails for threads on same host

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746138#comment-15746138
 ] 

ASF GitHub Bot commented on KAFKA-4509:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2233


> Task reusage on rebalance fails for threads on same host
> 
>
> Key: KAFKA-4509
> URL: https://issues.apache.org/jira/browse/KAFKA-4509
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
> Fix For: 0.10.2.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance 
> was introduces as a performance optimization. Instead of closing a task on 
> rebalance (ie, {{onPartitionsRevoked()}}, it only get's suspended for a 
> potential reuse in {{onPartitionsAssigned()}}. Only if a task cannot be 
> reused, it will eventually get closed in {{onPartitionsAssigned()}}.
> This mechanism can fail, if multiple {{StreamThreads}} run in the same host 
> (same or different JVM). The scenario is as follows:
>  - assume 2 running threads A and B
>  - assume 3 tasks t1, t2, t3
>  - assignment: A-(t1,t2) and B-(t3)
>  - on the same host, a new single threaded Stream application (same app-id) 
> gets started (thread C)
>  - on rebalance, t2 (could also be t1 -- does not matter) will be moved from 
> A to C
>  - as assignment is only sticky base on an heurictic t1 can sometimes be 
> assigned to B, too -- and t3 get's assigned to A (thre is a race condition if 
> this "task flipping" happens or not)
>  - on revoke, A will suspend task t1 and t2 (not releasing any locks)
>  - on assign
> - A tries to create t3 but as B did not release it yet, A dies with an 
> "cannot get lock" exception
> - B tries to create t1 but as A did not release it yet, B dies with an 
> "cannot get lock" exception
> - as A and B trie to create the task first, this will always fail if task 
> flipping happened
>- C tries to create t2 but A did not release t2 lock yet (race condition) 
> and C dies with an exception (this could even happen without "task flipping" 
> between A and B)
> We want to fix this, by:
>   # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, 
> and afterward create new tasks (this fixes the "task flipping" issue)
>   # use a "backoff and retry mechanism" if a task cannot be created (to 
> handle release-create race condition between different threads)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2233: KAFKA-4509: Task reusage on rebalance fails for th...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2233


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (KAFKA-4509) Task reusage on rebalance fails for threads on same host

2016-12-13 Thread Guozhang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-4509:
-
   Resolution: Fixed
Fix Version/s: 0.10.2.0
   Status: Resolved  (was: Patch Available)

Issue resolved by pull request 2233
[https://github.com/apache/kafka/pull/2233]

> Task reusage on rebalance fails for threads on same host
> 
>
> Key: KAFKA-4509
> URL: https://issues.apache.org/jira/browse/KAFKA-4509
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
> Fix For: 0.10.2.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance 
> was introduces as a performance optimization. Instead of closing a task on 
> rebalance (ie, {{onPartitionsRevoked()}}, it only get's suspended for a 
> potential reuse in {{onPartitionsAssigned()}}. Only if a task cannot be 
> reused, it will eventually get closed in {{onPartitionsAssigned()}}.
> This mechanism can fail, if multiple {{StreamThreads}} run in the same host 
> (same or different JVM). The scenario is as follows:
>  - assume 2 running threads A and B
>  - assume 3 tasks t1, t2, t3
>  - assignment: A-(t1,t2) and B-(t3)
>  - on the same host, a new single threaded Stream application (same app-id) 
> gets started (thread C)
>  - on rebalance, t2 (could also be t1 -- does not matter) will be moved from 
> A to C
>  - as assignment is only sticky base on an heurictic t1 can sometimes be 
> assigned to B, too -- and t3 get's assigned to A (thre is a race condition if 
> this "task flipping" happens or not)
>  - on revoke, A will suspend task t1 and t2 (not releasing any locks)
>  - on assign
> - A tries to create t3 but as B did not release it yet, A dies with an 
> "cannot get lock" exception
> - B tries to create t1 but as A did not release it yet, B dies with an 
> "cannot get lock" exception
> - as A and B trie to create the task first, this will always fail if task 
> flipping happened
>- C tries to create t2 but A did not release t2 lock yet (race condition) 
> and C dies with an exception (this could even happen without "task flipping" 
> between A and B)
> We want to fix this, by:
>   # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, 
> and afterward create new tasks (this fixes the "task flipping" issue)
>   # use a "backoff and retry mechanism" if a task cannot be created (to 
> handle release-create race condition between different threads)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4534) StreamPartitionAssignor only ever updates the partitionsByHostState and metadataWithInternalTopics once.

2016-12-13 Thread Damian Guy (JIRA)
Damian Guy created KAFKA-4534:
-

 Summary: StreamPartitionAssignor only ever updates the 
partitionsByHostState and metadataWithInternalTopics once.
 Key: KAFKA-4534
 URL: https://issues.apache.org/jira/browse/KAFKA-4534
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 0.10.2.0
Reporter: Damian Guy
Assignee: Damian Guy
 Fix For: 0.10.2.0


StreamPartitionAssignor only ever updates the partitionsByHostState and 
metadataWithInternalTopics once. This results in incorrect metadata on 
rebalances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Avi Flax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745978#comment-15745978
 ] 

Avi Flax commented on KAFKA-4437:
-

Ah, great, thanks!

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: kafka-trunk-jdk8 #1099

2016-12-13 Thread Apache Jenkins Server
See 

Changes:

[wangguoz] MINOR: Fix Streams examples in documentation

--
[...truncated 3907 lines...]

kafka.utils.CommandLineUtilsTest > testParseEmptyArg STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg PASSED

kafka.utils.CommandLineUtilsTest > testParseSingleArg STARTED

kafka.utils.CommandLineUtilsTest > testParseSingleArg PASSED

kafka.utils.CommandLineUtilsTest > testParseArgs STARTED

kafka.utils.CommandLineUtilsTest > testParseArgs PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid PASSED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr STARTED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr PASSED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition STARTED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition PASSED

kafka.utils.JsonTest > testJsonEncoding STARTED

kafka.utils.JsonTest > testJsonEncoding PASSED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask PASSED

kafka.utils.SchedulerTest > testNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testRestart STARTED

kafka.utils.SchedulerTest > testRestart PASSED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler STARTED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler PASSED

kafka.utils.SchedulerTest > testPeriodicTask STARTED

kafka.utils.SchedulerTest > testPeriodicTask PASSED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing STARTED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing PASSED

kafka.utils.IteratorTemplateTest > testIterator STARTED

kafka.utils.IteratorTemplateTest > testIterator PASSED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 STARTED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 PASSED

kafka.utils.UtilsTest > testAbs STARTED

kafka.utils.UtilsTest > testAbs PASSED

kafka.utils.UtilsTest > testReplaceSuffix STARTED

kafka.utils.UtilsTest > testReplaceSuffix PASSED

kafka.utils.UtilsTest > testCircularIterator STARTED

kafka.utils.UtilsTest > testCircularIterator PASSED

kafka.utils.UtilsTest > testReadBytes STARTED

kafka.utils.UtilsTest > testReadBytes PASSED

kafka.utils.UtilsTest > testCsvList STARTED

kafka.utils.UtilsTest > testCsvList PASSED

kafka.utils.UtilsTest > testReadInt STARTED

kafka.utils.UtilsTest > testReadInt PASSED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID STARTED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID PASSED

kafka.utils.UtilsTest > testCsvMap STARTED

kafka.utils.UtilsTest > testCsvMap PASSED

kafka.utils.UtilsTest > testInLock STARTED

kafka.utils.UtilsTest > testInLock PASSED

kafka.utils.UtilsTest > testSwallow STARTED

kafka.utils.UtilsTest > testSwallow PASSED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic STARTED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic PASSED

kafka.producer.AsyncProducerTest > testQueueTimeExpired STARTED

kafka.producer.AsyncProducerTest > testQueueTimeExpired PASSED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents STARTED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents PASSED

kafka.producer.AsyncProducerTest > testBatchSize STARTED

kafka.producer.AsyncProducerTest > testBatchSize PASSED

kafka.producer.AsyncProducerTest > testSerializeEvents STARTED

kafka.producer.AsyncProducerTest > testSerializeEvents PASSED

kafka.producer.AsyncProducerTest > testProducerQueueSize STARTED

kafka.producer.AsyncProducerTest > testProducerQueueSize PASSED

kafka.producer.AsyncProducerTest > testRandomPartitioner STARTED

kafka.producer.AsyncProducerTest > testRandomPartitioner PASSED

kafka.producer.AsyncProducerTest > testInvalidConfiguration STARTED

kafka.producer.AsyncProducerTest > testInvalidConfiguration PASSED

kafka.producer.AsyncProducerTest > testInvalidPartition STARTED

kafka.producer.AsyncProducerTest > testInvalidPartition PASSED

kafka.producer.AsyncProducerTest > testNoBroker STARTED

kafka.producer.AsyncProducerTest > testNoBroker PASSED

kafka.producer.AsyncProducerTest > testProduceAfterClosed STARTED

kafka.producer.AsyncProducerTest > testProduceAfterClosed PASSED

kafka.producer.AsyncProducerTest > testJavaProducer STARTED

kafka.producer.AsyncProducerTest > testJavaProducer PASSED

kafka.producer.AsyncProducerTest > 

[jira] [Commented] (KAFKA-4529) tombstone may be removed earlier than it should

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745956#comment-15745956
 ] 

ASF GitHub Bot commented on KAFKA-4529:
---

GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/2251

KAFKA-4529; Fix the issue that tombstone can be deleted too early.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-4529

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2251


commit 42f284bf0c1897899b4d620d18de9f66f65617f3
Author: Jiangjie Qin 
Date:   2016-12-13T19:02:08Z

KAFKA-4529; Fix the issue that tombstone can be deleted too early.




> tombstone may be removed earlier than it should
> ---
>
> Key: KAFKA-4529
> URL: https://issues.apache.org/jira/browse/KAFKA-4529
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Jun Rao
>Assignee: Jiangjie Qin
> Fix For: 0.10.1.1
>
>
> As part of KIP-33, we introduced a regression on how tombstone is removed in 
> a compacted topic. We want to delay the removal of a tombstone to avoid the 
> case that a reader first reads a non-tombstone message on a key and then 
> doesn't see the tombstone for the key because it's deleted too quickly. So, a 
> tombstone is supposed to only be removed from a compacted topic after the 
> tombstone is part of the cleaned portion of the log after delete.retention.ms.
> Before KIP-33, deleteHorizonMs in LogCleaner is calculated based on the last 
> modified time, which is monotonically increasing from old to new segments. 
> With KIP-33, deleteHorizonMs is calculated based on the message timestamp, 
> which is not necessarily monotonically increasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2251: KAFKA-4529; Fix the issue that tombstone can be de...

2016-12-13 Thread becketqin
GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/2251

KAFKA-4529; Fix the issue that tombstone can be deleted too early.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-4529

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2251


commit 42f284bf0c1897899b4d620d18de9f66f65617f3
Author: Jiangjie Qin 
Date:   2016-12-13T19:02:08Z

KAFKA-4529; Fix the issue that tombstone can be deleted too early.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745919#comment-15745919
 ] 

Tom DeVoe commented on KAFKA-4477:
--

Sorry about that [~apurva], I attached the state change and controller logs 
from that period in the state_change_controller.tar.gz tarball.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom DeVoe updated KAFKA-4477:
-
Attachment: state_change_controller.tar.gz

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4390) Replace MessageSet usage with client-side equivalents

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745892#comment-15745892
 ] 

ASF GitHub Bot commented on KAFKA-4390:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2140


> Replace MessageSet usage with client-side equivalents
> -
>
> Key: KAFKA-4390
> URL: https://issues.apache.org/jira/browse/KAFKA-4390
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
> Fix For: 0.10.2.0
>
>
> Currently we have two separate implementations of Kafka's message format and 
> log structure, one on the client side and one on the server side. Once 
> KAFKA-2066 is merged, we will only be using the client side objects for 
> direct serialization/deserialization in the request APIs, but we we still be 
> using the server-side MessageSet objects everywhere else. Ideally, we can 
> update this code to use the client objects everywhere so that future message 
> format changes only need to be made in one place. This would eliminate the 
> potential for implementation differences and gives us a uniform API for 
> accessing the low-level log structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2140: KAFKA-4390: Replace MessageSet usage with client-s...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (KAFKA-4390) Replace MessageSet usage with client-side equivalents

2016-12-13 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-4390.

   Resolution: Fixed
Fix Version/s: 0.10.2.0

Issue resolved by pull request 2140
[https://github.com/apache/kafka/pull/2140]

> Replace MessageSet usage with client-side equivalents
> -
>
> Key: KAFKA-4390
> URL: https://issues.apache.org/jira/browse/KAFKA-4390
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
> Fix For: 0.10.2.0
>
>
> Currently we have two separate implementations of Kafka's message format and 
> log structure, one on the client side and one on the server side. Once 
> KAFKA-2066 is merged, we will only be using the client side objects for 
> direct serialization/deserialization in the request APIs, but we we still be 
> using the server-side MessageSet objects everywhere else. Ideally, we can 
> update this code to use the client objects everywhere so that future message 
> format changes only need to be made in one place. This would eliminate the 
> potential for implementation differences and gives us a uniform API for 
> accessing the low-level log structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Hojjat Jafarpour
Hi all,

The following is a KIP for removing zkClient dependency from Streams.
Please check out the KIP page:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-90+-+Remove+zkClient+dependency+from+Streams

Thanks,
--Hojjat


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745817#comment-15745817
 ] 

Jun Rao commented on KAFKA-4477:


[~michael.andre.pearce], [~tdevoe],  was the following exception in the 
follower continuous after it started?
java.io.IOException: Connection to 1002 was disconnected before the response 
was read

Do you know the FollowerFetch request latency (reported by jmx) on the leader 
when the issue happened?

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4269) Multiple KStream instances with at least one Regex source causes NPE when using multiple consumers

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745818#comment-15745818
 ] 

ASF GitHub Bot commented on KAFKA-4269:
---

Github user bbejeck closed the pull request at:

https://github.com/apache/kafka/pull/2090


> Multiple KStream instances with at least one Regex source causes NPE when 
> using multiple consumers
> --
>
> Key: KAFKA-4269
> URL: https://issues.apache.org/jira/browse/KAFKA-4269
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
> Fix For: 0.10.1.1, 0.10.2.0
>
>
> I discovered this issue while doing testing for for KAFKA-4114. 
> KAFKA-4131 fixed the issue of a _single_ KStream with a regex source on 
> partitioned topics across multiple consumers.
> //KAFKA-4131 fixed this case assuming an "foo*" topics are partitioned
> KStream kstream = builder.source(Pattern.compile("foo.*"));
> KafkaStream stream = new KafkaStreams(builder, props);
> stream.start();  
> This is a new issue where there are _multiple_
> KStream instances (and one has a regex source) within a single KafkaStreams 
> object. When running the second or "following"
> consumer there are NPE errors generated in the RecordQueue.addRawRecords 
> method when attempting to consume records. 
> For example:
> KStream kstream = builder.source(Pattern.compile("foo.*"));
> KStream kstream2 = builder.source(.): //can be regex or named topic 
> sources
> KafkaStream stream = new KafkaStreams(builder, props);
> stream.start();
> By adding an additional KStream instance like above (whether Regex or Named 
> topic) causes a NPE when run as "follower"
> From my initial debugging I can see the TopicPartition assignments being set 
> on the "follower" KafkaStreams instance, but need to track down why and where 
> all assignments aren't being set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2090: KAFKA-4269: Follow up for 0.10.1 branch -update to...

2016-12-13 Thread bbejeck
Github user bbejeck closed the pull request at:

https://github.com/apache/kafka/pull/2090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745813#comment-15745813
 ] 

Jun Rao commented on KAFKA-4477:


[~tdevoe], thanks for the clarification. Then, it looks similar to what's 
reported in the jira.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745809#comment-15745809
 ] 

ASF GitHub Bot commented on KAFKA-4532:
---

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2250

KAFKA-4532: StateStores can be connected to the wrong source topic 
resulting in incorrect metadata returned from Interactive Queries

When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
TopologyBuilder.stateStoreNameToSourceTopics() and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.
There is an issue where by this mapping for a table that is originally 
created with builder.table("topic", "table");, and then is subsequently used in 
a join, is changed to the join topic. This is because the mapping is updated 
during the call to topology.connectProcessorAndStateStores(..).
In the case that the stateStoreNameToSourceTopics Map already has a value 
for the state store name it should not update the Map.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4532

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2250


commit 45833be3ca1b6f8ac86f516cae4ff1b6571089e8
Author: Damian Guy 
Date:   2016-12-13T18:06:38Z

state store name to topic mapping incorrect




> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request #2250: KAFKA-4532: StateStores can be connected to the wr...

2016-12-13 Thread dguy
GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2250

KAFKA-4532: StateStores can be connected to the wrong source topic 
resulting in incorrect metadata returned from Interactive Queries

When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
TopologyBuilder.stateStoreNameToSourceTopics() and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.
There is an issue where by this mapping for a table that is originally 
created with builder.table("topic", "table");, and then is subsequently used in 
a join, is changed to the join topic. This is because the mapping is updated 
during the call to topology.connectProcessorAndStateStores(..).
In the case that the stateStoreNameToSourceTopics Map already has a value 
for the state store name it should not update the Map.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4532

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2250


commit 45833be3ca1b6f8ac86f516cae4ff1b6571089e8
Author: Damian Guy 
Date:   2016-12-13T18:06:38Z

state store name to topic mapping incorrect




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745782#comment-15745782
 ] 

Jun Rao commented on KAFKA-4477:


[~michael.andre.pearce], the 0.10.1.1 release will need RC1 since we need to 
fix another critical bug.

Note that the cause of the leader dropping all followers out of ISR is 
potentially different from dropping just 1 follower. For your issue, it would 
be useful to also look at the controller/state-change log around that time to 
see if the followers have received any LeaderAndIsrRequests.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Description: 
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

  was:
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

This is also effects Interactive Queries. The metadata returned when trying to 
find the instance for a key from the source table is incorrect as the store 
name is mapped to the incorrect topic.


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/13/16 5:49 PM:


It is worth noting we see the open file descriptors increase as mentioned by 
someone else if we leave the process in a sick mode (now we restart quickly we 
don't get to observe this).


was (Author: michael.andre.pearce):
It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Dev list subscribe

2016-12-13 Thread Rajini Sivaram
Sorry, that was a mail sent by mistake.

On Tue, Dec 13, 2016 at 5:39 PM, Guozhang Wang  wrote:

> Rajini,
>
> It's self-service :)
>
> https://kafka.apache.org/contact
>
> Guozhang
>
> On Tue, Dec 13, 2016 at 5:38 AM, Rajini Sivaram 
> wrote:
>
> > Please subscribe me to the Kafka dev list.
> >
> >
> > Thank you,
> >
> > Rajini
> >
>
>
>
> --
> -- Guozhang
>


[jira] [Resolved] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-4497.

Resolution: Fixed

[~roschumann], I merged Jiangjie's batch to 0.10.1 branch. Do you think you 
could give it a try and see if it fixes your issue?

The fix for trunk will be included in KAFKA-4390. Closing this jira for now.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> 

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745742#comment-15745742
 ] 

ASF GitHub Bot commented on KAFKA-4497:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2243


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
> at scala.collection.immutable.List.foreach(List.scala:381)
> 

[GitHub] kafka pull request #2243: KAFKA-4497: LogCleaner appended the wrong offset t...

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2243


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka pull request #2247: MINOR: Fix Streams examples in documentation

2016-12-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2247


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745734#comment-15745734
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi Jun,

The stack was taken by the automated restart script we've had to put in place 
before it restarted the nodes, which picked up the issue 20 seconds after it 
started.

The broker during the period is not under high load. We do not see any GC 
issues, nor do we see any ZK issues.

The logs we are seeing are matching those of other people, we have had this 
occur 3 times further all having very similar logs aka nothing new is showing 
up.

On a side note, we are looking to upgrade to 0.10.1.1 as soon as its released 
and we see it released by Confluent also. We do this as we expect some further 
sanity checks have occurred and use this as a measure to check no critical 
issues,

We will aim to push to UAT quickly (where we see this issue also (weirdly we 
haven't had this occur in TEST or DEV)) to see if this is resolved. What is the 
expected timeline for this? We still expecting it to be released today? And 
when would Confluent likely to complete their testing and release.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Dev list subscribe

2016-12-13 Thread Guozhang Wang
Rajini,

It's self-service :)

https://kafka.apache.org/contact

Guozhang

On Tue, Dec 13, 2016 at 5:38 AM, Rajini Sivaram  wrote:

> Please subscribe me to the Kafka dev list.
>
>
> Thank you,
>
> Rajini
>



-- 
-- Guozhang


Re: [DISCUSS] KIP-81: Max in-flight fetches

2016-12-13 Thread Rajini Sivaram
Coordinator starvation: For an implementation based on KIP-72, there will
be coordinator starvation without KAFKA-4137 since you would stop reading
from sockets when the memory pool is full (the fact that coordinator
messages are small doesn't help). I imagine you can work around this by
treating coordinator connections as special connections but that spills
over to common network code. Separate NetworkClient for coordinator
proposed in KAFKA-4137 would be much better.

On Tue, Dec 13, 2016 at 3:47 PM, Mickael Maison 
wrote:

> Thanks for all the feedback.
>
> I've updated the KIP with all the details.
> Below are a few of the main points:
>
> - Overall memory usage of the consumer:
> I made it clear the memory pool is only used to store the raw bytes
> from the network and that the decompressed/deserialized messages are
> not stored in it but as extra memory on the heap. In addition, the
> consumer also keeps track of other things (in flight requests,
> subscriptions, etc..) that account for extra memory as well. So this
> is not a hard bound memory constraint but should still allow to
> roughly size how much memory can be used.
>
> - Relation with the existing settings:
> There are already 2 settings that deal with memory usage of the
> consumer. I suggest we lower the priority of
> `max.partition.fetch.bytes` (I wonder if we should attempt to
> deprecate it or increase its default value so it's a contraint less
> likely to be hit) and have the new setting `buffer.memory` as High.
> I'm a bit unsure what's the best default value for `buffer.memory`, I
> suggested 100MB in the KIP (2 x `fetch.max.bytes`), but I'd appreciate
> feedback. It should always at least be equal to `max.fetch.bytes`.
>
> - Configuration name `buffer.memory`:
> I think it's the name that makes the most sense. It's aligned with the
> producer and as mentioned generic enough to allow future changes if
> needed.
>
> - Coordination starvation:
> Yes this is a potential issue. I'd expect these requests to be small
> enough to not be affected too much. If that's the case KAFKA-4137
> suggests a possible fix.
>
>
>
> On Tue, Dec 13, 2016 at 9:31 AM, Ismael Juma  wrote:
> > Makes sense Jay.
> >
> > Mickael, in addition to how we can compute defaults of the other settings
> > from `buffer.memory`, it would be good to specify what is allowed and how
> > we handle the different cases (e.g. what do we do if
> > `max.partition.fetch.bytes`
> > is greater than `buffer.memory`, is that simply not allowed?).
> >
> > To summarise the gap between the ideal scenario (user specifies how much
> > memory the consumer can use) and what is being proposed:
> >
> > 1. We will decompress and deserialize the data for one or more partitions
> > in order to return them to the user and we don't account for the
> increased
> > memory usage resulting from that. This is likely to be significant on a
> per
> > record basis, but we try to do it for the minimal number of records
> > possible within the constraints of the system. Currently the constraints
> > are: we decompress and deserialize the data for a partition at a time
> > (default `max.partition.fetch.bytes` is 1MB, but this is a soft limit in
> > case there are oversized messages) until we have enough records to
> > satisfy `max.poll.records`
> > (default 500) or there are no more completed fetches. It seems like this
> > may be OK for a lot of cases, but some tuning will still be required in
> > others.
> >
> > 2. We don't account for bookkeeping data structures or intermediate
> objects
> > allocated during the general operation of the consumer. Probably
> something
> > we have to live with as the cost/benefit of fixing this doesn't seem
> worth
> > it.
> >
> > Ismael
> >
> > On Tue, Dec 13, 2016 at 8:34 AM, Jay Kreps  wrote:
> >
> >> Hey Ismael,
> >>
> >> Yeah I think we are both saying the same thing---removing only works if
> you
> >> have a truly optimal strategy. Actually even dynamically computing a
> >> reasonable default isn't totally obvious (do you set fetch.max.bytes to
> >> equal buffer.memory to try to queue up as much data in the network
> buffers?
> >> Do you try to limit it to your socket.receive.buffer size so that you
> can
> >> read it in a single shot?).
> >>
> >> Regarding what is being measured, my interpretation was the same as
> yours.
> >> I was just adding to the previous point that buffer.memory setting would
> >> not be a very close proxy for memory usage. Someone was pointing out
> that
> >> compression would make this true, and I was just adding that even
> without
> >> compression the object overhead would lead to a high expansion factor.
> >>
> >> -Jay
> >>
> >> On Mon, Dec 12, 2016 at 11:53 PM, Ismael Juma 
> wrote:
> >>
> >> > Hi Jay,
> >> >
> >> > About `max.partition.fetch.bytes`, yes it was an oversight not to
> lower
> >> its
> >> > priority as part of KIP-74 given the existence of 

[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745672#comment-15745672
 ] 

Matthias J. Sax commented on KAFKA-4437:


There is a mailing list thread... Just forgot to update the KIP Wiki page... 
Here it goes: 
http://search-hadoop.com/m/Kafka/uyzND1YI7Uf2hpKcc?subj=+DISCUSS+KIP+95+Incremental+Batch+Processing+for+Kafka+Streams

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745663#comment-15745663
 ] 

Tom DeVoe commented on KAFKA-4477:
--

[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745663#comment-15745663
 ] 

Tom DeVoe edited comment on KAFKA-4477 at 12/13/16 5:20 PM:


[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate any sessions expired at the time 
this issue occurred.


was (Author: tdevoe):
[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-83 - Allow multiple SASL PLAIN authenticated Java clients in a single JVM process

2016-12-13 Thread Edoardo Comar
Thanks for your review, Ismael.

First, I am no longer sure KIP-83 is worth keeping as KIP, I created it 
just before Rajini's 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-85%3A+Dynamic+JAAS+configuration+for+Kafka+clients
With KIP-85 as presented, my proposal has become a simple JIRA, there are 
no interface changes on top of KIP-85.
So I'll have no objection if you want to retire it as part of your 
cleanup.

As for your comments :
1) We can change the map to use the Password object as a key in the 
LoginManager cache, so logging its content won't leak the key.
Though I can't see why we would log the content of the cache.

2) If two clients use the same Jaas Config value, they will obtain the 
same LoginManager.
No new concurrency issue would arise as this happens today with any two 
clients (Producers/Consumers) in the same process.

3) Based on most jaas.config samples I have seen for kerberos and 
sasl/plain, the text used as key should be no larger than 0.5k.

Please let us know of any other concerns you may have, as 
IBM Message Hub is very eager to have the issue 
https://issues.apache.org/jira/browse/KAFKA-4180 merged in the next 
release (February timeframe 0.10.2 ? 0.11 ?). 
so we're waiting for Rajini's 
https://issues.apache.org/jira/browse/KAFKA-4259 on which our changes are 
based.

thanks
Edo
--
Edoardo Comar
IBM MessageHub
eco...@uk.ibm.com
IBM UK Ltd, Hursley Park, SO21 2JN

IBM United Kingdom Limited Registered in England and Wales with number 
741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 
3AU



From:   Ismael Juma 
To: dev@kafka.apache.org
Date:   13/12/2016 12:49
Subject:Re: [DISCUSS] KIP-83 - Allow multiple SASL PLAIN 
authenticated Java clients in a single JVM process
Sent by:isma...@gmail.com



Thanks for the KIP. A few comments:

1. The suggestion is to use the JAAS config value as the key to the map in
`LoginManager`. The config value can include passwords, so we could
potentially end up leaking them if we log the keys of `LoginManager`. This
seems a bit dangerous.

2. If someone uses the same JAAS config value in two clients, they'll get
the same `JaasConfig`, which seems fine, but worth mentioning (it means
that the `JaasConfig` has to be thread-safe).

3. How big can a JAAS config get? Is it an issue to use it as a map key?
Probably not given how this is used, but worth covering in the KIP as 
well.

Ismael

On Tue, Sep 27, 2016 at 10:15 AM, Edoardo Comar  wrote:

> Hi,
> I had a go at a KIP that addresses this JIRA
> https://issues.apache.org/jira/browse/KAFKA-4180
> "Shared authentification with multiple actives Kafka 
producers/consumers"
>
> which is a limitation of the current Java client that we (IBM 
MessageHub)
> get asked quite often lately.
>
> We will have a go at a PR soon, just as a proof of concept, but as it
> introduces new public interfaces it needs a KIP.
>
> I'll welcome your input.
>
> Edo
> --
> Edoardo Comar
> MQ Cloud Technologies
> eco...@uk.ibm.com
> +44 (0)1962 81 5576
> IBM UK Ltd, Hursley Park, SO21 2JN
>
> IBM United Kingdom Limited Registered in England and Wales with number
> 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. 
PO6
> 3AU
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Description: 
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

This is also effects Interactive Queries. The metadata returned when trying to 
find the instance for a key from the source table is incorrect as the store 
name is mapped to the incorrect topic.

  was:
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.
> This is also effects Interactive Queries. The metadata returned when trying 
> to find the instance for a key from the source table is incorrect as the 
> store name is mapped to the incorrect topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Summary: StateStores can be connected to the wrong source topic resulting 
in incorrect metadata returned from IQ  (was: StateStores can be connected to 
the wrong source topic)

> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

2016-12-13 Thread Michael Pearce
Hi Ismael

Did you see our email this morning, what's your thoughts on this approach to 
instead we simply have a brand new policy?

Cheers
Mike


Sent using OWA for iPhone

From: isma...@gmail.com  on behalf of Ismael Juma 

Sent: Tuesday, December 13, 2016 11:30:05 AM
To: dev@kafka.apache.org
Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

Yes, this is actually tricky to do in a way where we both have the desired
semantics and maintain compatibility. When someone creates a
`ProducerRecord` with a `null` value today, the producer doesn't know if
it's meant to be a tombstone or not. For V3 messages, it's easy when the
constructor that takes a tombstone is used. However, if any other
constructor is used, it's not clear. A couple of options I can think of,
none of them particularly nice:

1. Have a third state where tombstone = unknown and the broker would set
the tombstone bit if the value was null and the topic was compacted. People
that wanted to pass a non-null value for the tombstone would have to use
the constructor that takes a tombstone. The drawbacks: third state for
tombstone in message format, message conversion at the broker for a common
case.

2. Extend MetadataResponse to optionally include topic configs, which would
make it possible for the producer to be smarter about setting the
tombstone. It would only do it if a tombstone was not passed explicitly,
the value was null and the topic was compacted. The main drawback is that
the producer would be getting a bit more data for each topic even though it
probably won't use it most of the time. Extending MetadataResponse to
return topic configs would be useful for other reasons as well, so that
part seems OK.

In addition, for both proposals, we could consider adding warnings to the
documentation that the behaviour of the constructors that don't take a
tombstone would change in the next major release so that tombstone = false.
Not sure if this would be worth it though.

Ismael

On Sun, Dec 11, 2016 at 11:15 PM, Ewen Cheslack-Postava 
wrote:

> Michael,
>
> It kind of depends on how you want to interpret the tombstone flag. If it's
> purely a producer-facing Kafka-level thing that we treat as internal to the
> broker and log cleaner once the record is sent, then your approach makes
> sense. You're just moving copying the null-indicates-delete behavior of the
> old constructor into the tombstone flag.
>
> However, if you want this change to more generally decouple the idea of
> deletion and null values, then you are sometimes converting what might be a
> completely valid null value that doesn't indicate deletion into a
> tombstone. Downstream applications could potentially handle these cases
> differently given the separation of deletion from value.
>
> I guess the question is if we want to try to support the latter even for
> topics where we have older produce requests. An example where this could
> come up is in something like a CDC Connector. If we try to support the
> semantic difference, a connector might write changes to Kafka using the
> tombstone flag to indicate when a row was truly deleted (vs an update that
> sets it to null but still present; this probably makes more sense for CDC
> from document stores or extracting single columns). There are various
> reasons we might want to maintain the full log and not turn compaction on
> (or just use a time-based retention policy), but downstream applications
> might care to know the difference between a delete and a null value. In
> fact both versions of the same log (compacted and time-retention) could be
> useful and I don't think it'll be uncommon to maintain both or use KIP-71
> to maintain a hybrid compacted/retention topic.
>
> -Ewen
>
> On Sun, Dec 11, 2016 at 1:18 PM, Michael Pearce 
> wrote:
>
> > Hi Jay,
> >
> > Why wouldn't that work, the tombstone value is only looked at by the
> > broker, on a topic configured for compaction as such is benign on non
> > compacted topics. This is as much as sending a null value currently
> >
> >
> > Regards
> > Mike
> >
> >
> >
> > Sent using OWA for iPhone
> > 
> > From: Jay Kreps 
> > Sent: Sunday, December 11, 2016 8:58:53 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag
> >
> > Hey Michael,
> >
> > I'm not quite sure that works as that would translate ALL null values to
> > tombstones, even for non-compacted topics that use null as an acceptable
> > value sent by the producer and expected by the consumer.
> >
> > -Jay
> >
> > On Sun, Dec 11, 2016 at 3:26 AM, Michael Pearce 
> > wrote:
> >
> > > Hi Ewen,
> > >
> > > I think the easiest way to show this is with code.
> > >
> > > As you can see we keep the existing behaviour for code/binaries calling
> > > the pre-existing constructors, whereby 

[jira] [Created] (KAFKA-4533) subscribe() then poll() on new topic is very slow when subscribed to many topics

2016-12-13 Thread Sergey Alaev (JIRA)
Sergey Alaev created KAFKA-4533:
---

 Summary: subscribe() then poll() on new topic is very slow when 
subscribed to many topics
 Key: KAFKA-4533
 URL: https://issues.apache.org/jira/browse/KAFKA-4533
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 0.10.1.0
Reporter: Sergey Alaev


Given following case:

consumer.subscribe(my_new_topic, [249 existing topics])
publisher.send(my_new_topic, key, value)
poll(10) until data from my_new_topic arrives

I see data from `my_new_topic` only after approx. 90 seconds.

If I subscribe only to my_new_topic, I get results within seconds.

Logs contain lots of lines like this:

19:28:07.972 [kafka-thread] DEBUG 
org.apache.kafka.clients.consumer.internals.Fetcher - Resetting offset for 
partition demo.com_recipient-2-0 to earliest offset.
19:28:08.247 [kafka-thread] DEBUG 
org.apache.kafka.clients.consumer.internals.Fetcher - Fetched {timestamp=-1, 
offset=0} for partition demo.com_recipient-2-0

Probably you should do that in batch.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >