[jira] [Created] (KAFKA-5142) KIP-145 - Expose Record Headers in Kafka Connect

2017-04-29 Thread Michael Andre Pearce (IG) (JIRA)
Michael Andre Pearce (IG) created KAFKA-5142:


 Summary: KIP-145 - Expose Record Headers in Kafka Connect
 Key: KAFKA-5142
 URL: https://issues.apache.org/jira/browse/KAFKA-5142
 Project: Kafka
  Issue Type: New Feature
  Components: clients
Reporter: Michael Andre Pearce (IG)


https://cwiki.apache.org/confluence/display/KAFKA/KIP-145+-+Expose+Record+Headers+in+Kafka+Connect

As KIP-82 introduced Headers into the core Kafka Product, it would be 
advantageous to expose them in the Kafka Connect Framework.
Connectors that replicate data between Kafka cluster or between other messaging 
products and Kafka would want to replicate the headers.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804713#comment-15804713
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 3:00 PM:
--

[~ijuma], [~becket_qin], [~junrao] - thanks and sorry again for your time, 
effort and patience. 



was (Author: michael.andre.pearce):
[~ijuma][~becket_qin][~junrao] - thanks and sorry again for your time, effort 
and patience. 


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.sc

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804713#comment-15804713
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~ijuma][~becket_qin][~junrao] - thanks and sorry again for your time, effort 
and patience. 


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCl

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804705#comment-15804705
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~ijuma]
Yes we all work together. 

Re what tripped us up, re the un-versioned kafka.jar in the confluent package 
is there a reason this is done? (I know this is Apache but worth the ask)

We are happy for closing this. We cannot see or repeat the issue on 0.10.1.1

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> 

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804660#comment-15804660
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 2:46 PM:
--

[~junrao] Sorry about this, it does indeed seem to be our patching that caused 
the issue, now we have removed the jar in question all seems to be good.

We have been testing all morning. We have not seen the issue exist again.

To explain what occurred - Our patching of the confluent package was done by 
replacing all jars. 

download latest from apache site. (2.11 scala version)

expand and copy all jars from:
./kafka_2.11-0.10.1.1/libs

into 
confluent-3.1.0/share/java/kafka

then remove all files in
confluent-3.1.0/share/java/kafka
with *-0.10.1.0-cp1*.jar

What has caught us out was that in the confluent bundle the jar 
"kafka_2.11-0.10.1.0.jar" seems to be renamed "kafka.jar" without any 
versioning, unlike all the other Apache jars, which are versioned and are 
appended with -CP so on doing the above we ended up with both

kafka_2.11-0.10.1.1.jar and kafka.jar

as the kafka.jar was not removed.

We didn't easily spot this, as obviously we didn't see any 0.10.1.0 versioned 
jars on the class path, and like wise we did see 0.10.1.1 versions for all the 
expected jars.

Whilst this is obviously the apache jira, we're aware that confluent bundle is 
by yourselves, is there a reason why this is not versioned like all the other 
jars and as per the apache release in the confluent bundle? This obviously has 
tripped us up.



was (Author: michael.andre.pearce):
[~junrao] Sorry about this, it does indeed seem to be our patching that caused 
the issue, now we have removed the jar in question all seems to be good.

We have been testing all morning. We have not seen the issue exist again.

To explain what occurred - Our patching of the confluent package was done by 
replacing all jars. 

download latest from apache site. (2.11 scala version)

expand and copy all jars from:
./kafka_2.11.-0.10.1.1/libs

into 
confluent-3.1.0/share/java/kafka

then remove all files in
confluent-3.1.0/share/java/kafka
with *-0.10.1.0-cp1*.jar

What has caught us out was that in the confluent bundle the jar 
"kafka_2.11-0.10.1.0.jar" seems to be renamed "kafka.jar" without any 
versioning, unlike all the other Apache jars, which are versioned and are 
appended with -CP so on doing the above we ended up with both

kafka_2.11-0.10.1.1.jar and kafka.jar

as the kafka.jar was not removed.

We didn't easily spot this, as obviously we didn't see any 0.10.1.0 versioned 
jars on the class path, and like wise we did see 0.10.1.1 versions for all the 
expected jars.

Whilst this is obviously the apache jira, we're aware that confluent bundle is 
by yourselves, is there a reason why this is not versioned like all the other 
jars and as per the apache release in the confluent bundle? This obviously has 
tripped us up.


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cl

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804660#comment-15804660
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 2:45 PM:
--

[~junrao] Sorry about this, it does indeed seem to be our patching that caused 
the issue, now we have removed the jar in question all seems to be good.

We have been testing all morning. We have not seen the issue exist again.

To explain what occurred - Our patching of the confluent package was done by 
replacing all jars. 

download latest from apache site. (2.11 scala version)

expand and copy all jars from:
./kafka_2.11.-0.10.1.1/libs

into 
confluent-3.1.0/share/java/kafka

then remove all files in
confluent-3.1.0/share/java/kafka
with *-0.10.1.0-cp1*.jar

What has caught us out was that in the confluent bundle the jar 
"kafka_2.11-0.10.1.0.jar" seems to be renamed "kafka.jar" without any 
versioning, unlike all the other Apache jars, which are versioned and are 
appended with -CP so on doing the above we ended up with both

kafka_2.11-0.10.1.1.jar and kafka.jar

as the kafka.jar was not removed.

We didn't easily spot this, as obviously we didn't see any 0.10.1.0 versioned 
jars on the class path, and like wise we did see 0.10.1.1 versions for all the 
expected jars.

Whilst this is obviously the apache jira, we're aware that confluent bundle is 
by yourselves, is there a reason why this is not versioned like all the other 
jars and as per the apache release in the confluent bundle? This obviously has 
tripped us up.



was (Author: michael.andre.pearce):
[~junrao] Sorry about this, it does indeed seem to be our patching that caused 
the issue, now we have removed the jar in question all seems to be good.

We have been testing all morning. We have not seen the issue exist again.

To explain what occurred - Our patching of the confluent package was done by 
replacing all jars. 

download latest from apache site. (2.11 scala version)

expand and copy all jars from:
./kafka_2.11.-0.10.1.1/libs

into 
confluent-3.1-2.1/share/java/kafka

then remove all files in
confluent-3.1-2.1/share/java/kafka
with *-0.10.1.0-cp2*.jar

What has caught us out was that in the confluent bundle the jar 
"kafka_2.11-0.10.1.0.jar" seems to be renamed "kafka.jar" without any 
versioning, unlike all the other Apache jars, which are versioned and are 
appended with -CP so on doing the above we ended up with both

kafka_2.11-0.10.1.1.jar and kafka.jar

as the kafka.jar was not removed.

We didn't easily spot this, as obviously we didn't see any 0.10.1.0 versioned 
jars on the class path, and like wise we did see 0.10.1.1 versions for all the 
expected jars.

Whilst this is obviously the apache jira, we're aware that confluent bundle is 
by yourselves, is there a reason why this is not versioned like all the other 
jars and as per the apache release in the confluent bundle? This obviously has 
tripped us up.


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from t

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804660#comment-15804660
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] Sorry about this, it does indeed seem to be our patching that caused 
the issue, now we have removed the jar in question all seems to be good.

We have been testing all morning. We have not seen the issue exist again.

To explain what occurred - Our patching of the confluent package was done by 
replacing all jars. 

download latest from apache site. (2.11 scala version)

expand and copy all jars from:
./kafka_2.11.-0.10.1.1/libs

into 
confluent-3.1-2.1/share/java/kafka

then remove all files in
confluent-3.1-2.1/share/java/kafka
with *-0.10.1.0-cp2*.jar

What has caught us out was that in the confluent bundle the jar 
"kafka_2.11-0.10.1.0.jar" seems to be renamed "kafka.jar" without any 
versioning, unlike all the other Apache jars, which are versioned and are 
appended with -CP so on doing the above we ended up with both

kafka_2.11-0.10.1.1.jar and kafka.jar

as the kafka.jar was not removed.

We didn't easily spot this, as obviously we didn't see any 0.10.1.0 versioned 
jars on the class path, and like wise we did see 0.10.1.1 versions for all the 
expected jars.

Whilst this is obviously the apache jira, we're aware that confluent bundle is 
by yourselves, is there a reason why this is not versioned like all the other 
jars and as per the apache release in the confluent bundle? This obviously has 
tripped us up.


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogClea

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802030#comment-15802030
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:35 AM:
--

When we got the issue on 0.10.1.1 we were just as surprised and as such that 
was our very first thing we checked.

our internal mail at the time when this occured in UAT (first env we saw it 
post 0.10.1.1 upgrade):

>

Date: Tuesday, 3 January 2017 at 17:07
Subject: Re: Kafka Issues fixed in 0.10.1.1 how we can patch until confluent 
release.

Yeah I already double checked that, all the cp1 jars are gone and the 
0.10.1.1’s exist, pasted below, hope ive missed something..?
 
Class path: 
:/opt/projects/confluent/bin/../share/java/kafka/jetty-continuation-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/aopalliance-repackaged-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-file-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/argparse4j-0.5.0.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-json-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/avro-1.7.7.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-beanutils-1.8.3.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-codec-1.9.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-http-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-collections-3.2.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-compress-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-digester-1.8.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-lang3-3.1.jar:/opt/projects/confluent/bin/../share/java/kafka/log4j-1.2.17.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-logging-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/lz4-1.3.0.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-validator-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/metrics-core-2.2.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-io-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-util-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jopt-simple-4.9.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-runtime-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/guava-18.0.jar:/opt/projects/confluent/bin/../share/java/kafka/osgi-resource-locator-1.0.1.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-api-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/paranamer-2.3.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-locator-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/reflections-0.9.10.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-utils-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-clients-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpclient-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-streams-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpcore-4.4.3.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-tools-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpmime-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-annotations-2.6.0.jar:/opt/projects/confluent/bin/../share/java/kafka/rocksdbjni-4.9.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/scala-library-2.11.8.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-api-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-databind-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-log4j12-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-base-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-json-provider-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-mapper-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-media-jaxb-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-module-jaxb-annotations-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/snappy-java-1.1.2.6.jar:/opt/projects/confluent/bin/../share/java/kafka/javassist-3.18.2-GA.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-client-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.annotation-api-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-test.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-1.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-common-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-2.4.0-b34.jar:/opt/projects/confluent/bin/../s

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802030#comment-15802030
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:33 AM:
--

When we got the issue on 0.10.1.1 we were just as surprised and as such that 
was our very first thing we checked.

our internal mail at the time when this occured in UAT (first env we saw it 
post 0.10.1.1 upgrade):

>

From: Andrew Holford 
Date: Tuesday, 3 January 2017 at 17:07
Subject: Re: Kafka Issues fixed in 0.10.1.1 how we can patch until confluent 
release.

Yeah I already double checked that, all the cp1 jars are gone and the 
0.10.1.1’s exist, pasted below, hope ive missed something..?
 
Class path: 
:/opt/projects/confluent/bin/../share/java/kafka/jetty-continuation-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/aopalliance-repackaged-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-file-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/argparse4j-0.5.0.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-json-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/avro-1.7.7.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-beanutils-1.8.3.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-codec-1.9.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-http-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-collections-3.2.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-compress-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-digester-1.8.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-lang3-3.1.jar:/opt/projects/confluent/bin/../share/java/kafka/log4j-1.2.17.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-logging-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/lz4-1.3.0.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-validator-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/metrics-core-2.2.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-io-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-util-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jopt-simple-4.9.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-runtime-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/guava-18.0.jar:/opt/projects/confluent/bin/../share/java/kafka/osgi-resource-locator-1.0.1.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-api-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/paranamer-2.3.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-locator-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/reflections-0.9.10.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-utils-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-clients-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpclient-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-streams-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpcore-4.4.3.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-tools-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpmime-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-annotations-2.6.0.jar:/opt/projects/confluent/bin/../share/java/kafka/rocksdbjni-4.9.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/scala-library-2.11.8.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-api-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-databind-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-log4j12-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-base-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-json-provider-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-mapper-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-media-jaxb-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-module-jaxb-annotations-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/snappy-java-1.1.2.6.jar:/opt/projects/confluent/bin/../share/java/kafka/javassist-3.18.2-GA.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-client-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.annotation-api-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-test.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-1.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-common-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-2.4.0-b34.jar:/opt/proje

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803984#comment-15803984
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:31 AM:
--

[~junrao] As noted, we are patching the confluent 3.1.0 build with 0.10.1.1, 
just as anyway we will deploy 3.1.2 when its released (we assume you're 
ensuring all bugs really are fixed before releasing) to be sure, is there an 
SNAPSHOT build / package of the 3.1.2 with kafka 0.10.1.1 you have this way can 
avoid any possible bad patching or discrepancies by ourselves.


was (Author: michael.andre.pearce):
As noted, we are patching the confluent 3.1.0 build with 0.10.1.1, just as 
anyway we will deploy 3.1.2 when its released (we assume you're ensuring all 
bugs really are fixed before releasing) to be sure, is there an SNAPSHOT build 
/ package of the 3.1.2 with kafka 0.10.1.1 you have this way can avoid any 
possible bad patching or discrepancies by ourselves.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogClean

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803984#comment-15803984
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:30 AM:
--

As noted, we are patching the confluent 3.1.0 build with 0.10.1.1, just as 
anyway we will deploy 3.1.2 when its released (we assume you're ensuring all 
bugs really are fixed before releasing) to be sure, is there an SNAPSHOT build 
/ package of the 3.1.2 with kafka 0.10.1.1 you have this way can avoid any 
possible bad patching or discrepancies by ourselves.


was (Author: michael.andre.pearce):
As noted, we are patching the confluent 3.1.0 build with 0.10.1.1, just as 
anyway we will deploy 3.1.2 when its released (we assume you're ensuring all 
bugs really are fixed before releasing) to be sure, is there an SNAPSHOT build 
/ package of the 3.1.2 you guys have this way can avoid any possible bad 
patching or discrepancies by ourselves.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.Invali

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803984#comment-15803984
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

As noted, we are patching the confluent 3.1.0 build with 0.10.1.1, just as 
anyway we will deploy 3.1.2 when its released (we assume you're ensuring all 
bugs really are fixed before releasing) to be sure, is there an SNAPSHOT build 
/ package of the 3.1.2 you guys have this way can avoid any possible bad 
patching or discrepancies by ourselves.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeInd

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803919#comment-15803919
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:17 AM:
--

we have one discrepancy found with an old unversioned jar on cp, going to 
remove it and retest, if you hold of doing any further investigations we will 
let you know once we have re-deployed and checked. This indeed might be our bad.


was (Author: michael.andre.pearce):
we have one discrepancy found, going to remove it and retest

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$an

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803829#comment-15803829
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/6/17 8:03 AM:
--

[~junrao] We are fairly confident jars are good based on process cp, see above 
we double checked again the process class path on getting the issue last night, 
post reducing min.cleanable.dirty.ratio to make the cleaner kick in. If you 
note all jars are 0.10.1.1 that are kafka.

We can offer to deploy a custom built jar, which can contain further 
logging/debug in for you if you wish (as we have this in a testing env) or we 
can offer to even host a webex or something so you can look at the system your 
self.

 


was (Author: michael.andre.pearce):
[~junrao] We are 99.9% jars are good, see above we double checked again the 
process class path on getting the issue last night, post reducing 
min.cleanable.dirty.ratio to make the cleaner kick in. If you note all jars are 
0.10.1.1 that are kafka.

We can offer to deploy a custom built jar, which can contain further 
logging/debug in for you if you wish (as we have this in a testing env) or we 
can offer to even host a webex or something so you can look at the system your 
self.

 

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-06 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803919#comment-15803919
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

we have one discrepancy found, going to remove it and retest

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> k

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803893#comment-15803893
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

we are going to md5 checksum check every file this morning.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> ka

[jira] [Issue Comment Deleted] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Comment: was deleted

(was: [~junrao] We are 99.9% jars are good, see above we double checked again 
the process class path on getting the issue last night, post reducing 
min.cleanable.dirty.ratio to make the cleaner kick in. If you note all jars are 
0.10.1.1 that are kafka.

We can offer to deploy a custom built jar, which can contain further 
logging/debug in for you if you wish (as we have this in a testing env) or we 
can offer to even host a webex or something so you can look at the system your 
self.

 )

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonf

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803829#comment-15803829
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] We are 99.9% jars are good, see above we double checked again the 
process class path on getting the issue last night, post reducing 
min.cleanable.dirty.ratio to make the cleaner kick in. If you note all jars are 
0.10.1.1 that are kafka.

We can offer to deploy a custom built jar, which can contain further 
logging/debug in for you if you wish (as we have this in a testing env) or we 
can offer to even host a webex or something so you can look at the system your 
self.

 

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> k

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803830#comment-15803830
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] We are 99.9% jars are good, see above we double checked again the 
process class path on getting the issue last night, post reducing 
min.cleanable.dirty.ratio to make the cleaner kick in. If you note all jars are 
0.10.1.1 that are kafka.

We can offer to deploy a custom built jar, which can contain further 
logging/debug in for you if you wish (as we have this in a testing env) or we 
can offer to even host a webex or something so you can look at the system your 
self.

 

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> k

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802412#comment-15802412
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

confirming class path via the process: 

conflue+ 14435 14401  1 13:21 ?00:05:53 
/opt/projects/java/sun-jdk8-1.8.0.66/jre/bin/java -cp 
:/opt/projects/confluent/bin/../share/java/kafka/commons-beanutils-1.8.3.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-client-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/argparse4j-0.5.0.jar:/opt/projects/confluent/bin/../share/java/kafka/avro-1.7.7.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-codec-1.9.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-common-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-collections-3.2.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-digester-1.8.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-lang3-3.1.jar:/opt/projects/confluent/bin/../share/java/kafka/lz4-1.3.0.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-test.jar:/opt/projects/confluent/bin/../share/java/kafka/xz-1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-api-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-runtime-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-log4j-appender-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/snappy-java-1.1.2.6.jar:/opt/projects/confluent/bin/../share/java/kafka/zkclient-0.9.jar:/opt/projects/confluent/bin/../share/java/kafka/httpclient-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-scaladoc.jar:/opt/projects/confluent/bin/../share/java/kafka/httpcore-4.4.3.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-file-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-clients-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-streams-examples-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-sources.jar:/opt/projects/confluent/bin/../share/java/kafka/zookeeper-3.4.8.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-mapper-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/httpmime-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.annotation-api-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-tools-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-json-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-streams-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-1.jar:/opt/projects/confluent/bin/../share/java/kafka/rocksdbjni-4.9.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.servlet-api-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-logging-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.ws.rs-api-2.0.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-validator-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-javadoc.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-test-sources.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-locator-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-utils-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-util-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-databind-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-base-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-json-provider-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-module-jaxb-annotations-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/javassist-3.18.2-GA.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/reflections-0.9.10.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-client-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-common-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jopt-simple-4.9.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-container-servlet-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-container-servlet-core-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-guava-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-media-jaxb-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/scala-library-2.11.8.jar:/opt/projects/confluent/bin/../share/java/kafka/log4j-1.2.17.jar:/opt/projects/confluent/bin/../share/java/kafka/metr

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802394#comment-15802394
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] [~becket_qin] anything else you guys need? As its just happened and 
more chance of capturing anything extra.

[2017-01-05 19:37:07,144] ERROR [kafka-log-cleaner-thread-0], Error due to  
(kafka.log.LogCleaner)
kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
20 no larger than the last offset appended (50346) to /v
ar/kafka/logs/com_ig_trade_v1_position_event--demo--compacted-14/.timeindex.cleaned.
at 
kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
at kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
at kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
at kafka.log.LogSegment.append(LogSegment.scala:106)
at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
at 
kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
at 
kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:400)
at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:364)
at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:363)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Cleaner.clean(LogCleaner.scala:363)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:239)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:218)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2017-01-05 19:37:07,144] INFO [kafka-log-cleaner-thread-0], Stopped  
(kafka.log.LogCleaner)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,9

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802390#comment-15802390
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

log4j  log-cleaner.log with TRACE, 
https://issues.apache.org/jira/secure/attachment/12845849/log-cleaner.log.1.zip

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cl

[jira] [Updated] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Attachment: log-cleaner.log.1.zip

logcleaner log for error on 0.10.1.1 @2017-01-05 19:37:07,144

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, 
> log-cleaner.log.1.zip, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802365#comment-15802365
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

https://issues.apache.org/jira/secure/attachment/12845847/com_ig_trade_v1_position_event--demo--compacted-14.tar.gz

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner

[jira] [Updated] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Attachment: com_ig_trade_v1_position_event--demo--compacted-14.tar.gz

partiton/segment logs for issue on 0.10.1.1 - 2017-01-05 19:37:07,144

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> com_ig_trade_v1_position_event--demo--compacted-14.tar.gz, vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> k

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802326#comment-15802326
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[2017-01-05 19:37:07,137] TRACE Inserting 88493 bytes at offset 49647 at 
position 4393661 with largest timestamp 1480518430821 at offset 49859 
(kafka.log.LogSegment)
[2017-01-05 19:37:07,138] INFO Cleaner 0: Cleaning segment 49860 in log 
com_ig_trade_v1_position_event--demo--compacted-14 (largest timestamp Wed Dec 
07 15:06:26 GMT 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2017-01-05 19:37:07,138] TRACE Inserting 15951 bytes at offset 49863 at 
position 4482154 with largest timestamp 1481123186887 at offset 49971 
(kafka.log.LogSegment)
[2017-01-05 19:37:07,138] INFO Cleaner 0: Cleaning segment 49972 in log 
com_ig_trade_v1_position_event--demo--compacted-14 (largest timestamp Wed Dec 
14 14:22:00 GMT 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2017-01-05 19:37:07,138] TRACE Inserting 7496 bytes at offset 49972 at 
position 4498105 with largest timestamp 1481722619382 at offset 50151 
(kafka.log.LogSegment)
[2017-01-05 19:37:07,138] INFO Cleaner 0: Cleaning segment 50155 in log 
com_ig_trade_v1_position_event--demo--compacted-14 (largest timestamp Wed Dec 
21 15:26:33 GMT 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2017-01-05 19:37:07,138] TRACE Inserting 12632 bytes at offset 50158 at 
position 4505601 with largest timestamp 1482321739128 at offset 50346 
(kafka.log.LogSegment)
[2017-01-05 19:37:07,138] INFO Cleaner 0: Cleaning segment 50349 in log 
com_ig_trade_v1_position_event--demo--compacted-14 (largest timestamp Wed Dec 
28 16:07:21 GMT 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2017-01-05 19:37:07,140] TRACE Inserting 11743 bytes at offset 50350 at 
position 4518233 with largest timestamp 1482933300393 at offset -1 
(kafka.log.LogSegment)
[2017-01-05 19:37:07,144] ERROR [kafka-log-cleaner-thread-0], Error due to  
(kafka.log.LogCleaner)
kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
20 no larger than the last offset appended (50346) to 
/var/kafka/logs/com_ig_trade_v1_position_event--demo--compacted-14/.timeindex.cleaned.
at 
kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
at kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
at kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
at kafka.log.LogSegment.append(LogSegment.scala:106)
at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
at 
kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
at 
kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:400)
at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:364)
at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:363)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Cleaner.clean(LogCleaner.scala:363)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:239)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:218)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created comple

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802319#comment-15802319
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

Perfect, that triggered it to run, and now has occured again, just going to 
capture all the bits for you again.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSe

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802233#comment-15802233
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/5/17 7:14 PM:
--

[~junrao] , we took it from the website, via the links. (edited remove exact 
mirror link, as actually im not 100% i know we downloaded from website though, 
will confirm which mirror exactly tomorrow if it makes any difference)

Will look to reduce that down then.


was (Author: michael.andre.pearce):
[~junrao] , we took it from the website, via the links, downloading from the 
first mirror site, 
http://apache.mirror.anlx.net/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz

Will look to reduce that down then.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802233#comment-15802233
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] , we took it from the website, via the links, downloading from the 
first mirror site, 
http://apache.mirror.anlx.net/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz

Will look to reduce that down then.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802034#comment-15802034
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

Btw since re-rolling forward the 0.10.1.1 back into a testing env, we're still 
waiting for re-occurance.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802030#comment-15802030
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

When we got the issue on 0.10.1.1 we were just as surprised and as such that 
was our very first thing we checked.

our internal mail at the time when this occured in UAT (first env we saw it 
post 0.10.1.1 upgrade):

>

From: Andrew Holford 
Date: Tuesday, 3 January 2017 at 17:07
To: Michael Pearce , Thomas Brown 
Cc: William Hargrove , Anton Goldenfarb 
, Neil Laurance , Gavin Sandie 

Subject: Re: Kafka Issues fixed in 0.10.1.1 how we can patch until confluent 
release.

Yeah I already double checked that, all the cp1 jars are gone and the 
0.10.1.1’s exist, pasted below, hope ive missed something..?
 
Class path: 
:/opt/projects/confluent/bin/../share/java/kafka/jetty-continuation-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/aopalliance-repackaged-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-file-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/argparse4j-0.5.0.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-json-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/avro-1.7.7.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-beanutils-1.8.3.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-codec-1.9.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-http-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-collections-3.2.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-compress-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-digester-1.8.1.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-lang3-3.1.jar:/opt/projects/confluent/bin/../share/java/kafka/log4j-1.2.17.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-logging-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/lz4-1.3.0.jar:/opt/projects/confluent/bin/../share/java/kafka/commons-validator-1.4.1.jar:/opt/projects/confluent/bin/../share/java/kafka/metrics-core-2.2.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-io-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jetty-util-9.2.15.v20160210.jar:/opt/projects/confluent/bin/../share/java/kafka/jopt-simple-4.9.jar:/opt/projects/confluent/bin/../share/java/kafka/connect-runtime-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/guava-18.0.jar:/opt/projects/confluent/bin/../share/java/kafka/osgi-resource-locator-1.0.1.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-api-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/paranamer-2.3.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-locator-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/reflections-0.9.10.jar:/opt/projects/confluent/bin/../share/java/kafka/hk2-utils-2.4.0-b34.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-clients-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpclient-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-streams-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpcore-4.4.3.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka-tools-0.10.1.1.jar:/opt/projects/confluent/bin/../share/java/kafka/httpmime-4.5.1.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-annotations-2.6.0.jar:/opt/projects/confluent/bin/../share/java/kafka/rocksdbjni-4.9.0.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/scala-library-2.11.8.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-core-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-api-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-databind-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/slf4j-log4j12-1.7.21.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-base-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-jaxrs-json-provider-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-mapper-asl-1.9.13.jar:/opt/projects/confluent/bin/../share/java/kafka/jersey-media-jaxb-2.22.2.jar:/opt/projects/confluent/bin/../share/java/kafka/jackson-module-jaxb-annotations-2.6.3.jar:/opt/projects/confluent/bin/../share/java/kafka/snappy-java-1.1.2.6.jar:/opt/projects/confluent/bin/../share/java/kafka/javassist-3.18.2-GA.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-client-3.1.0.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.annotation-api-1.2.jar:/opt/projects/confluent/bin/../share/java/kafka/kafka_2.11-0.10.1.1-test.jar:/opt/projects/confluent/bin/../share/java/kafka/javax.inject-1.jar:/opt/projects/confluent/bin/../share/java/kafka/support-metrics-common-3.1.0.jar:/opt/projects/confluent

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801230#comment-15801230
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/5/17 12:37 PM:
---

[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments, when we had it occur on 
0.10.1.1.

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 version, whilst subsequently we 
have seen it re-occur much faster, we cannot tell how long we will have to wait)


was (Author: michael.andre.pearce):
[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments, when we had it occur on 
0.10.1.1.

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it re-occur much faster, we cannot tell how long we will have to wait)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retai

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801230#comment-15801230
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/5/17 12:36 PM:
---

[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments, when we had it occur on 
0.10.1.1.

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it re-occur much faster, we cannot tell how long we will have to wait)


was (Author: michael.andre.pearce):
[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it re-occur much faster, we cannot tell how long we will have to wait)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801230#comment-15801230
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/5/17 12:34 PM:
---

[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it re-occur much faster, we cannot tell how long we will have to wait)


was (Author: michael.andre.pearce):
[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it much faster, we cannot tell how long we will have to wait)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-l

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801230#comment-15801230
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/5/17 12:33 PM:
---

[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the initial issue on 0.10.1.0 issue, whilst subsequently we have 
seen it much faster, we cannot tell how long we will have to wait)


was (Author: michael.andre.pearce):
[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the issue, we cannot tell how long we will have to wait)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidO

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801244#comment-15801244
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

https://issues.apache.org/jira/secure/attachment/12845787/vrtstokf005_4thJan

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)

[jira] [Updated] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Attachment: vrtstokf005_4thJan

[~becket_qin] we have managed to recover the log4j cleaner log from yesterdays, 
from the same node we supplied the log segments. 

We have now re rolled back forward that testing environment back onto 0.10.1.1 
and waiting for it to re-occur. (bear in mind we had been running 6 weeks 
before we saw the issue, we cannot tell how long we will have to wait)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2, 
> vrtstokf005_4thJan
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15800869#comment-15800869
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

re configs, i can give these now.

log.cleaner.backoff.ms = 15000
log.cleaner.dedupe.buffer.size = 134217728
log.cleaner.delete.retention.ms = 8640
log.cleaner.enable = true
log.cleaner.io.buffer.load.factor = 0.9
log.cleaner.io.buffer.size = 524288
log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
log.cleaner.min.cleanable.ratio = 0.5
log.cleaner.min.compaction.lag.ms = 0
log.cleaner.threads = 1
log.cleanup.policy = [delete]

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15800617#comment-15800617
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

Hi [~becket_qin] Sorry we thought [~junrao] only asked for the segment logs so 
we just packaged up that partition logs not the log4j also. We've now rolled 
back that environment. 

I will discuss today to see if we can roll back forwards and see if we can 
re-create again on 0.10.1.1, so we're 100% clear, as if we do it, we will 
probably roll back once we detect the issue again, can you list what else you 
will need or want precisely so we capture everything.

Cheers
Mike

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$may

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799732#comment-15799732
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 1/4/17 11:44 PM:
---

We haven't seen this re-occur, though running <1 week only running 0.10.1.1 in 
testing and uat envs.


was (Author: michael.andre.pearce):
We haven't seen this re-occur, though running <1 week only running in testing 
and uat envs.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799732#comment-15799732
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

We haven't seen this re-occur, though running <1 week only running in testing 
and uat envs.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799599#comment-15799599
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

Btw, we are using actually normally the Confluent build. though we decided to 
patch 3.1.x with 0.10.1.1 manually to see if it solved the issue in our testing 
and uat envs, thus we found it didn't resolve the issue. 

If you find the issue and have a fix, would you imagine you'd get it patched 
into the upcoming confluent build based on 0.10.1.1 , so we wouldn't have to 
immediately hot patch that?

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inL

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799594#comment-15799594
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] just attached, with everything. As its testing env its all non real 
data, so we can share.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:40

[jira] [Updated] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Attachment: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2

[~junrao] files (logs and everything) from the partition testing env, where we 
had it occur at 15:52 today (4th Jan 2017) .

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: com_ig_trade_v1_order_event--demo--compacted-7.tar.bz2
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$clea

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799490#comment-15799490
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/4/17 10:09 PM:
---

[~junrao]We had this occur now in our testing env also (so now 4 envs 
affected), so we can get the log segments from there. Currently just sorting 
out another environment with our operations team tonight trying to handle the 
fact disks are filling rapidly, once done we will go try get them out of 
testing env, onto this ticket tonight, so you have them to look at during your 
day/our night.

RE: our patch apart from it being naive, and obviously a plaster not a cure, as 
obviously the time index code is new, and i assume you have better knowledge of 
that code, would you for see any knock on issues with applying that? As we can 
tell it does look like it doesn't affect anything else, but we don't use time 
index feature yet. 


was (Author: michael.andre.pearce):
[~junrao]We had this occur now in our testing env also (so now 4 envs 
affected), so we can get the log segments from there. Currently just sorting 
out another environment with our operations team tonight trying to handle the 
fact disks are filling rapidly, once done we will go try get them out of 
testing env, onto this ticket tonight, so you have them to look at during your 
day/our night.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799490#comment-15799490
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao]We had this occur now in our testing env also (so now 4 envs 
affected), so we can get the log segments from there. Currently just sorting 
out another environment with our operations team tonight trying to handle the 
fact disks are filling rapidly, once done we will go try get them out of 
testing env, onto this ticket tonight, so you have them to look at during your 
day/our night.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:

[jira] [Comment Edited] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798847#comment-15798847
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4497 at 1/4/17 5:43 PM:
--

[~junrao] This is causing us a production issue on the current version, 
currently also looking now at rolling back like eBay has to 0.10.0.1, unless we 
get this fixed asap. This is a terminal issue.

Whilst we agree in the purist view that we should fix the root cause, and happy 
to add logging later to further find that, is there any reason not to accept 
this PR to fix/treating the symptom to avoid production issue.

I would imagine others also will start hitting the issue too, we're not alone, 
as such would cause blocker to anyone upgrading to 0.10.1.0/0.10.1.1 now.


was (Author: michael.andre.pearce):
[~junrao] This is causing us a production issue on the current version, 
currently also looking now at rolling back like eBay has to 0.10.0.1, unless we 
get this fixed asap.

Whilst we agree in the purist view that we should fix the root cause, and happy 
to add logging later to further find that, is there any reason not to accept 
this PR to fix/treating the symptom to avoid production issue.

I would imagine others also will start hitting the issue too, we're not alone, 
as such would cause blocker to anyone upgrading to 0.10.1.0/0.10.1.1 now.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798847#comment-15798847
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~junrao] This is causing us a production issue on the current version, 
currently also looking now at rolling back like eBay has to 0.10.0.1, unless we 
get this fixed asap.

Whilst we agree in the purist view that we should fix the root cause, and happy 
to add logging later to further find that, is there any reason not to accept 
this PR to fix/treating the symptom to avoid production issue.

I would imagine others also will start hitting the issue too, we're not alone, 
as such would cause blocker to anyone upgrading to 0.10.1.0/0.10.1.1 now.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppen

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798007#comment-15798007
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~ijuma] it seems exactly the same, we note that its blowing up because -1 is 
still being sent to the TimeIndex, as noted above when -1 it means there is no 
offset for the messageset, as noted above it can be empty,  also as noted 
earlier a naive approach to solve is to simply avoid it throwing the exception 
so the compaction thread isn't killed. 

As such can you review and comment on our PR which is simply in the maybeAppend 
method to not do anything if offset is -1 (special case). Whilst naive it seems 
this avoids the issue. Can we confirm there is nothing adversely missed with 
this approach.



> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:1

[jira] [Issue Comment Deleted] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4497:
-
Comment: was deleted

(was: [~ijuma] it seems exactly the same, we note that its blowing up because 
-1 is still being sent to the TimeIndex, as noted above when -1 it means there 
is no offset for the messageset, as noted above it can be empty,  also as noted 
earlier a naive approach to solve is to simply avoid it throwing the exception 
so the compaction thread isn't killed. 

As such can you review and comment on our PR which is simply in the maybeAppend 
method to not do anything if offset is -1 (special case). Whilst naive it seems 
this avoids the issue. Can we confirm there is nothing adversely missed with 
this approach.

)

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798006#comment-15798006
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4497:
--

[~ijuma] it seems exactly the same, we note that its blowing up because -1 is 
still being sent to the TimeIndex, as noted above when -1 it means there is no 
offset for the messageset, as noted above it can be empty,  also as noted 
earlier a naive approach to solve is to simply avoid it throwing the exception 
so the compaction thread isn't killed. 

As such can you review and comment on our PR which is simply in the maybeAppend 
method to not do anything if offset is -1 (special case). Whilst naive it seems 
this avoids the issue. Can we confirm there is nothing adversely missed with 
this approach.



> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:1

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-16 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755064#comment-15755064
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Agreed, this one indeed does seem very similar to that.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-15 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4477:
-
Comment: was deleted

(was: IG ISR issue of 2016-12-15 04:27 (this time we see deadlock) attached.)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-15 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4477:
-
Attachment: 2016_12_15.zip

IG ISR issue of 2016-12-15 04:27 (this time we see deadlock) attached.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-15 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15751019#comment-15751019
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi [~junrao]

We had similar issue this morning, this time though we DO see deadlock.

I've attached all the logs, and stack we gathered from all the process's. logs 
are in csv original log line is in column _raw in the csv.

Ive also attached screenshots of our monitoring graphs of the follower stats, 
we do see a spike, but this seems to be post restart of the process (i think we 
should expect that?)

All are in 2016_15_12.zip.

Cheers
Mike



> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-15 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750994#comment-15750994
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Have we a timeline on RC1? 

It would be good to have this available asap.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:59 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt, we didn't run for a long period on 
0.10.0.0 before upgrading some brokers to 0.10.1.0), is there any possible way 
to downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> Al

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:58 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are curr

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:55 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/13/16 5:49 PM:


It is worth noting we see the open file descriptors increase as mentioned by 
someone else if we leave the process in a sick mode (now we restart quickly we 
don't get to observe this).


was (Author: michael.andre.pearce):
It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745734#comment-15745734
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi Jun,

The stack was taken by the automated restart script we've had to put in place 
before it restarted the nodes, which picked up the issue 20 seconds after it 
started.

The broker during the period is not under high load. We do not see any GC 
issues, nor do we see any ZK issues.

The logs we are seeing are matching those of other people, we have had this 
occur 3 times further all having very similar logs aka nothing new is showing 
up.

On a side note, we are looking to upgrade to 0.10.1.1 as soon as its released 
and we see it released by Confluent also. We do this as we expect some further 
sanity checks have occurred and use this as a measure to check no critical 
issues,

We will aim to push to UAT quickly (where we see this issue also (weirdly we 
haven't had this occur in TEST or DEV)) to see if this is resolved. What is the 
expected timeline for this? We still expecting it to be released today? And 
when would Confluent likely to complete their testing and release.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-05 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721875#comment-15721875
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

This occurred again in a prod environment just after 2am on Sat. 

Ive attached the stack trace that was captured before the node was restarted by 
our platform operations team. Looking at the stack trace, there are no 
deadlocks, unlike the JIRA ticket you mentioned.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Priority: Critical
>  Labels: reliability
> Attachments: kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-05 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4477:
-
Attachment: kafka.jstack

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Priority: Critical
>  Labels: reliability
> Attachments: kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-02 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715205#comment-15715205
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

What we see as per the logs is that the node that goes ill first reduces its 
ISR's then we see disconnects from everywhere else. It does seem similar, but 
we never see it re-expand the ISR's unlike the log files they've submitted.

We have had the issue in 3 environments, 5 times in the last week and a half. 
We are trying to push through automated stack trace gathering. We unfortunately 
don't use our APM tools to auto instrument on Kafka (note one reason for KIP-82)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Priority: Critical
>  Labels: reliability
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-02 Thread Michael Andre Pearce (IG) (JIRA)
Michael Andre Pearce (IG) created KAFKA-4477:


 Summary: Node reduces its ISR to itself, and doesn't recover. 
Other nodes do not take leadership, cluster remains sick until node is 
restarted.
 Key: KAFKA-4477
 URL: https://issues.apache.org/jira/browse/KAFKA-4477
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.10.1.0
 Environment: RHEL7

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
Reporter: Michael Andre Pearce (IG)
Priority: Critical


We have encountered a critical issue that has re-occured in different physical 
environments. We haven't worked out what is going on. We do though have a nasty 
work around to keep service alive. 

We do have not had this issue on clusters still running 0.9.01.

We have noticed a node randomly shrinking for the partitions it owns the ISR's 
down to itself, moments later we see other nodes having disconnects, followed 
by finally app issues, where producing to these partitions is blocked.

It seems only by restarting the kafka instance java process resolves the issues.

We have had this occur multiple times and from all network and machine 
monitoring the machine never left the network, or had any other glitches.

Below are seen logs from the issue.

Node 7:
[2016-12-01 07:01:28,112] INFO Partition 
[com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR 
for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 
to 7 (kafka.cluster.Partition)

All other nodes:
[2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 7 was disconnected before the response was 
read

All clients:
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.NetworkException: The server disconnected before 
a response was received.


After this occurs, we then suddenly see on the sick machine an increasing 
amount of close_waits and file descriptors.

As a work around to keep service we are currently putting in an automated 
process that tails and regex's for: and where new_partitions hit just itself we 
restart the node. 

"\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
partition \[.*\] from (?P.+) to (?P.+) 
\(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 1:03 AM:


The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded/invoked that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual(jump) tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation/test and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code in the blog, and running it with or without final 
implementations makes no difference.

Also i've taken this test from the above blog, for your final and non final 
cases (i've attached to this jira), if you note I've uploaded two versions, one 
with the final being declared and loaded by the JVM first and vice versa. As 
you note in both the implementation loaded first due to the inlined branch will 
be more performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.



was (Author: michael.andre.pearce):
The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded/invoked that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation/test and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code in the blog, and running it with or without final 
implementations makes no difference.

Also i've taken this test from the above blog, for your final and non final 
cases (i've attached to this jira), if you note I've uploaded two versions, one 
with the final being declared and loaded by the JVM first and vice versa. As 
you note in both the implementation loaded first due to the inlined branch will 
be more performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:59 AM:
-

The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded/invoked that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation/test and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code in the blog, and running it with or without final 
implementations makes no difference.

Also i've taken this test from the above blog, for your final and non final 
cases (i've attached to this jira), if you note I've uploaded two versions, one 
with the final being declared and loaded by the JVM first and vice versa. As 
you note in both the implementation loaded first due to the inlined branch will 
be more performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.



was (Author: michael.andre.pearce):
The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded/invoked that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you note in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:57 AM:
-

The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded/invoked that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you note in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.



was (Author: michael.andre.pearce):
The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you note in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:55 AM:
-

The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you note in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.



was (Author: michael.andre.pearce):
The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you not in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:54 AM:
-

The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and move to using 
virtual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you not in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.



was (Author: michael.andre.pearce):
The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and more to using 
visual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you not in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691817#comment-15691817
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4424:
--

The link you reference re virtual calls.

This is much more about monomorphic call or polymorphic calls. Making a class 
that implements an interface final, where the method invocation is by interface 
methods, does not change this.

This is more to do with the number of class's loaded that implement the 
interface. 

So in case of single implementation being used and loaded your jvm you have a 
monomorphic case for the interface, the JVM will inline this (final or not).

If you happen to have two implementations being used and loaded the jvm will 
still be able to inline but will create a branch case, the second loaded 
implementation will be slower if invoked due to the branch.

If you have more than two implementations loaded the JVM will on loading these 
do on stack replacement of the previously loaded inlined, and more to using 
visual tables.


You'll see this occur if you turn on -XX:+PrintCompilation

A classical implementation and write up showing this is:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note taking the code, and running it with or without final 
implementations makes no difference.

Also i've taken this classical test, for your final and non final cases 
(attached to this jira), if you note I've loaded two versions, one with the 
final being declared and loaded by the JVM first and vice versa. As you not in 
both the implementation loaded first due to the inlined branch will be more 
performant.

On checking your original test case we noted that the FinalByteArraySerializer 
version runs first (due to alphabetic ordering that test are run in) , as such 
it would be always the first in the inline branch benefitting from this, this 
would explain why it seems always final was negligible faster when running your 
benchmark test case.


> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4424:
-
Attachment: FinalTestReversed.java
FinalTest.java

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
> Attachments: FinalTest.java, FinalTestReversed.java
>
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4424:
-
Comment: was deleted

(was: Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will do on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't help, 
as all reference by calling classes are invoking methods on the interface, as 
such its performance is about how many implementations are loaded.)

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691736#comment-15691736
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:12 AM:
-

Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will do on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't help, 
as all reference by calling classes are invoking methods on the interface, as 
such its performance is about how many implementations are loaded.


was (Author: michael.andre.pearce):
Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will get on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't help, 
as all reference by calling classes are invoking methods on the interface, as 
such its performance is about how many implementations are loaded.

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691736#comment-15691736
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/24/16 12:12 AM:
-

Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will get on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't help, 
as all reference by calling classes are invoking methods on the interface, as 
such its performance is about how many implementations are loaded.


was (Author: michael.andre.pearce):
Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will get on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't do much.

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691736#comment-15691736
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4424:
--

Just reading the link you reference.

Virtual calls and jump tables is monomorphic vs polymorphic calls.

If only one implementation(this implementation being final or not) of an 
interface is loaded then you will get a monomorphic implementation and it can 
be fully inlined. 

If you load another, your method is inlined but with a branch, once you load 
further hotspot will get on stack replacement and this is when jump tables are 
needed.

You can see this occurring with on stack replacement with 
-XX:+PrintCompilation, as classes are loaded.

A classical test case showing this:
http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html

You'll note that making the class's final or not in the test, makes no 
difference to the outcome.

In case of serialisers/deserialisers this is our case as we have an interface 
that is implemented, as such making these implementations final doesn't do much.

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:30 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Please read:
http://www.oracle.com/technetwork/java/whitepaper-135217.html#impact
http://www.oracle.com/technetwork/java/whitepaper-135217.html#optimizations

Whilst this article (by  Brian Goetz - his credentials kinda speak for them 
selves) from way back when is very dated and now not relevant in regards to 
precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

This is a change to API level that becomes more restrictive, then i think 
really this should need a KIP, and no doubt some further tests and arguments on 
a KIP discussion for the pros and con's. 

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:28 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Please read:
http://www.oracle.com/technetwork/java/whitepaper-135217.html#impact

Whilst this article (by  Brian Goetz - his credentials kinda speak for them 
selves) from way back when is very dated and now not relevant in regards to 
precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

This is a change to API level that becomes more restrictive, then i think 
really this should need a KIP, and no doubt some further tests and arguments on 
a KIP discussion for the pros and con's. 

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:23 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Whilst this article (by  Brian Goetz - his credentials kinda speak for them 
selves) from way back when is very dated and now not relevant in regards to 
precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

This is a change to API level that becomes more restrictive, then i think 
really this should need a KIP, and no doubt some further tests and arguments on 
a KIP discussion for the pros and con's. 

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade 

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:22 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Whilst this article (by  Brian Goetz - his credentials kinda speak for them 
selves) from way back when is very dated and now not relevant in regards to 
precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

This is a change to API level, then i think really this should need a KIP, and 
no doubt some further tests and arguments on a KIP discussion for the pros and 
con's. 

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RH

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:12 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Whilst this article (by  Brian Goetz - his credentials kinda speak for them 
selves) from way back when is very dated and now not relevant in regards to 
precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

As stated, this is a change to API level, then i think really this should need 
a KIP, and no doubt some further tests and arguments on a KIP discussion for 
the pros and con's.

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 11:01 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a perfomance boost as people come to think, as the modern jvm 
compilers like hotspot (there is another commercial one which I'm sure we all 
know of ;) but as its propriety/commercial shall avoid it for this discussion 
), they really do a lot of magic under the hood for us.

Whilst this article from way back when is very dated and now not relevant in 
regards to precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

As stated, this is a change to API level, then i think really this should need 
a KIP, and no doubt some further tests and arguments on a KIP discussion for 
the pros and con's.

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in partic

[jira] [Comment Edited] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4424 at 11/23/16 10:57 PM:
-

Hi Matthias,

Thanks for putting together some initial tests.

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a speed boost as people come to think, as the modern jvm compilers 
like hotspot (there is another commercial one which I'm sure we all know of ;) 
but as its propriety/commercial shall avoid it for this discussion ), they 
really do a lot of magic under the hood for us.

Whilst this article from way back when is very dated and now not relevant in 
regards to precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

As stated, this is a change to API level, then i think really this should need 
a KIP, and no doubt some further tests and arguments on a KIP discussion for 
the pros and con's.

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.


was (Author: michael.andre.pearce):
Hi Matthias,

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen

[jira] [Commented] (KAFKA-4424) Make serializer classes final

2016-11-23 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691571#comment-15691571
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4424:
--

Hi Matthias,

When we took your test and ran it this evening (with the same parameters) we 
get:

# Run complete. Total time: 01:40:11

Benchmark   Mode   CntScore  Error  
Units
KafkaBenchmark.testFinalSerializer thrpt  1000 9378.179 ±   26.059  
ops/s
KafkaBenchmark.testFinalSerializerNoFlush  thrpt  1000  1283796.450 ± 4976.711  
ops/s
KafkaBenchmark.testSerializer  thrpt  1000 9325.273 ±   26.581  
ops/s
KafkaBenchmark.testSerializerNoFlush   thrpt  1000  1289296.549 ± 5127.774  
ops/s

The performance difference we are seeing is very negligible at best. We have 
run this across a few machines (1 macbook, 1 cisco blade server, 1 nutanix vm) 
within our company and get similar results between final and non-final. (we 
actually had one result come in from the linux vm running on our nutanix 
clusters where the non-final was negligible faster, we repeated and it reversed 
to have the other negligible faster, but it does show that this seems to be 
negligible).

We have run on all the machines using the latest Kakfa 0.10.1.0 version, on JDK 
1.8.0_112, VM 25.112-b16 and kafka setup locally (as per we understand you have 
done), the macbook was on El Capitan, and the cisco blade and nutanix vm are 
running RHEL7.

The above stats copied are from running in particular on a laptop (but are 
inline with what we've seen also in our server environment) just easier to copy 
from as our server environments are protected.

MacBook Pro (Retina, 15-inch, Mid 2015)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3


This is what we have come to expect, essentially making a class final doesn't 
give as much a speed boost as people come to think, as the modern jvm compilers 
like hotspot (there is another commercial one which I'm sure we all know of ;) 
but as its propriety/commercial shall avoid it for this discussion ), they 
really do a lot of magic under the hood for us.

Whilst this article from way back when is very dated and now not relevant in 
regards to precise jvm internals as a lot has moved on. 
http://www.ibm.com/developerworks/java/library/j-jtp1029/index.html

There is one core take away i think is important, and is more important when 
making a decision to use final or not.

"
final classes and methods can be a significant inconvenience when programming 
-- they limit your options for reusing existing code and extending the 
functionality of existing classes. While sometimes a class is made final for a 
good reason, such as to enforce immutability, the benefits of using final 
should outweigh the inconvenience. Performance enhancement is almost always a 
bad reason to compromise good object-oriented design principles, and when the 
performance enhancement is small or nonexistent, this is a bad trade-off indeed.
"

As stated, this is a change to API level, then i think really this should need 
a KIP, and no doubt some further tests and arguments on a KIP discussion for 
the pros and con's.

Its worth noting also very recently the ProducerRecord and ConsumerRecord, for 
extensibility reasons were made non-final, if you take the current master.

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-4424) Make serializer classes final

2016-11-20 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681447#comment-15681447
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4424:
--

Is there a supporting test to prove a significant performance improvement in an 
end 2 end setup after jit/jvm warm up? 

Obvious reasons is that this would break any code in organisations thats 
extended this as its part of the client api, so to make this change needs to 
prove a significant performance improvement % to warrant breaking these.

> Make serializer classes final
> -
>
> Key: KAFKA-4424
> URL: https://issues.apache.org/jira/browse/KAFKA-4424
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Matthias Bechtold
>Priority: Minor
>
> Implementations of simple serializers / deserializers should be final to 
> prevent JVM method call overhead.
> See also:
> https://wiki.openjdk.java.net/display/HotSpot/VirtualCalls
> This breaks the API slightly, inheritors must change to generic interfaces 
> Serializer / Deserializer. But architecture-wise final serialization classes 
> make the most sense to me.
> So what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4341) KIP-82 - Add Compaction Tombstone Flag

2016-10-25 Thread Michael Andre Pearce (IG) (JIRA)
Michael Andre Pearce (IG) created KAFKA-4341:


 Summary: KIP-82 - Add Compaction Tombstone Flag
 Key: KAFKA-4341
 URL: https://issues.apache.org/jira/browse/KAFKA-4341
 Project: Kafka
  Issue Type: Improvement
  Components: clients, core
Affects Versions: 0.10.1.0
Reporter: Michael Andre Pearce (IG)


Currently compaction works based on null value, there are some use cases where 
this is in effective or causes unclean work arounds (as discussed in KIP-82 
discussion) this is to add a tombstone flag to enable compaction to also work 
where value is not null, but a producer needs to signify this to the broker.

This JIRA is related to KIP found here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-87+-+Add+Compaction+Tombstone+Flag



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-4341) KIP-87 - Add Compaction Tombstone Flag

2016-10-25 Thread Michael Andre Pearce (IG) (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Andre Pearce (IG) updated KAFKA-4341:
-
Summary: KIP-87 - Add Compaction Tombstone Flag  (was: KIP-82 - Add 
Compaction Tombstone Flag)

> KIP-87 - Add Compaction Tombstone Flag
> --
>
> Key: KAFKA-4341
> URL: https://issues.apache.org/jira/browse/KAFKA-4341
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, core
>Affects Versions: 0.10.1.0
>Reporter: Michael Andre Pearce (IG)
>
> Currently compaction works based on null value, there are some use cases 
> where this is in effective or causes unclean work arounds (as discussed in 
> KIP-82 discussion) this is to add a tombstone flag to enable compaction to 
> also work where value is not null, but a producer needs to signify this to 
> the broker.
> This JIRA is related to KIP found here:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-87+-+Add+Compaction+Tombstone+Flag



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-4208) Add Record Headers

2016-09-22 Thread Michael Andre Pearce (IG) (JIRA)
Michael Andre Pearce (IG) created KAFKA-4208:


 Summary: Add Record Headers
 Key: KAFKA-4208
 URL: https://issues.apache.org/jira/browse/KAFKA-4208
 Project: Kafka
  Issue Type: New Feature
  Components: clients, core
Reporter: Michael Andre Pearce (IG)
Priority: Critical


Currently headers are not natively supported unlike many transport and 
messaging platforms or standard, this is to add support for headers to kafka

This JIRA is related to KIP found here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3846) Connect record types should include timestamps

2016-08-24 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435842#comment-15435842
 ] 

Michael Andre Pearce (IG) commented on KAFKA-3846:
--

This looks great, is any documents updated to show how we configure bits like  
TimebasedPartitioner mentioned here 
http://docs.confluent.io/3.0.0/connect/connect-hdfs/docs/configuration_options.html
 to use this? 

Also is it possible to store this timestamp ( and other meta) in the hdfs/hive 
record it seems we can only store the payload. From the config options in 
connect.

> Connect record types should include timestamps
> --
>
> Key: KAFKA-3846
> URL: https://issues.apache.org/jira/browse/KAFKA-3846
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.0.0
>Reporter: Ewen Cheslack-Postava
>Assignee: Shikhar Bhushan
>  Labels: needs-kip
> Fix For: 0.10.1.0
>
>
> Timestamps were added to records in the previous release, however this does 
> not get propagated automatically to Connect because it uses custom wrappers  
> to add fields and rename some for clarity.
> The addition of timestamps should be trivial, but can be really useful (e.g. 
> in sink connectors that would like to include timestamp info if available but 
> when it is not stored in the value).
> This is public API so it will need a KIP despite being very uncontentious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2260) Allow specifying expected offset on produce

2016-06-02 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311853#comment-15311853
 ] 

Michael Andre Pearce (IG) commented on KAFKA-2260:
--

Is this still active/alive? would be a real shame if this was left to die and a 
shame it didn't make it into 0.10, the concept and neatness of the solution 
very useful. We would have many uses for this.

> Allow specifying expected offset on produce
> ---
>
> Key: KAFKA-2260
> URL: https://issues.apache.org/jira/browse/KAFKA-2260
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Kirwin
>Assignee: Ewen Cheslack-Postava
>Priority: Minor
> Attachments: KAFKA-2260.patch, expected-offsets.patch
>
>
> I'd like to propose a change that adds a simple CAS-like mechanism to the 
> Kafka producer. This update has a small footprint, but enables a bunch of 
> interesting uses in stream processing or as a commit log for process state.
> h4. Proposed Change
> In short:
> - Allow the user to attach a specific offset to each message produced.
> - The server assigns offsets to messages in the usual way. However, if the 
> expected offset doesn't match the actual offset, the server should fail the 
> produce request instead of completing the write.
> This is a form of optimistic concurrency control, like the ubiquitous 
> check-and-set -- but instead of checking the current value of some state, it 
> checks the current offset of the log.
> h4. Motivation
> Much like check-and-set, this feature is only useful when there's very low 
> contention. Happily, when Kafka is used as a commit log or as a 
> stream-processing transport, it's common to have just one producer (or a 
> small number) for a given partition -- and in many of these cases, predicting 
> offsets turns out to be quite useful.
> - We get the same benefits as the 'idempotent producer' proposal: a producer 
> can retry a write indefinitely and be sure that at most one of those attempts 
> will succeed; and if two producers accidentally write to the end of the 
> partition at once, we can be certain that at least one of them will fail.
> - It's possible to 'bulk load' Kafka this way -- you can write a list of n 
> messages consecutively to a partition, even if the list is much larger than 
> the buffer size or the producer has to be restarted.
> - If a process is using Kafka as a commit log -- reading from a partition to 
> bootstrap, then writing any updates to that same partition -- it can be sure 
> that it's seen all of the messages in that partition at the moment it does 
> its first (successful) write.
> There's a bunch of other similar use-cases here, but they all have roughly 
> the same flavour.
> h4. Implementation
> The major advantage of this proposal over other suggested transaction / 
> idempotency mechanisms is its minimality: it gives the 'obvious' meaning to a 
> currently-unused field, adds no new APIs, and requires very little new code 
> or additional work from the server.
> - Produced messages already carry an offset field, which is currently ignored 
> by the server. This field could be used for the 'expected offset', with a 
> sigil value for the current behaviour. (-1 is a natural choice, since it's 
> already used to mean 'next available offset'.)
> - We'd need a new error and error code for a 'CAS failure'.
> - The server assigns offsets to produced messages in 
> {{ByteBufferMessageSet.validateMessagesAndAssignOffsets}}. After this 
> changed, this method would assign offsets in the same way -- but if they 
> don't match the offset in the message, we'd return an error instead of 
> completing the write.
> - To avoid breaking existing clients, this behaviour would need to live 
> behind some config flag. (Possibly global, but probably more useful 
> per-topic?)
> I understand all this is unsolicited and possibly strange: happy to answer 
> questions, and if this seems interesting, I'd be glad to flesh this out into 
> a full KIP or patch. (And apologies if this is the wrong venue for this sort 
> of thing!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)