[ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitay Kufert updated KAFKA-9824:
--------------------------------
    Description: 
Hello,
 around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
2.3.1), 
 and we started noticing a troubling behavior that we didn't see before:
  
 Without apparent reason, a specific partition on a specific consumer loses its 
offset and start re-consuming the entire partition from the beginning 
(according to the retention).
  
 Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
trackingSolutionAttribution-48 to offset 1216430527.
{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range 
for partition trackingSolutionAttribution-48
{quote}
Those are the logs from the brokers at the same time (searched for 
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 
0 ms.
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
{quote}
Another way to see the same thing, from our monitoring (DD) on the partition 
offset:

!image-2020-04-06-13-14-47-014.png|width=530,height=152!

The recovery you are seeing is after I run partition offset reset manually 
(using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
--reset-offsets --to-datetime 'SOME DATE')
  
 Any idea what can be causing this? we have it happen to us at least 5 times 
since the upgrade, and before that, I don't remember it ever happening to us.
  
 Topic config is set to default, except the retention, which is manually set to 
43200000.
 The topic has 60 partitions & a replication factor of 2. 
  
 Consumer config:
{code:java}
ConsumerConfig values:
        allow.auto.create.topics = true
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
        bootstrap.servers = [..........]
        check.crcs = true
        client.dns.lookup = default
        client.id =
        client.rack =
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = fireAttributionConsumerGroup4
        group.instance.id = null
        heartbeat.interval.ms = 10000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        security.providers = null
        send.buffer.bytes = 131072
        session.timeout.ms = 30000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = https
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
{code}
 
 Broker config (server.properties) - We have 9 brokers:
{code:java}
num.io.threads=4
num.network.threads=64
socket.send.buffer.bytes=33554432
socket.receive.buffer.bytes=33554432
socket.request.max.bytes=104857600
num.partitions=2
log.retention.hours=1
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
zookeeper.connection.timeout.ms=1000000
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.1
log.cleaner.threads=2
log.cleanup.policy=delete
log.cleaner.delete.retention.ms=86400000
log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=9223372036854775807
message.max.bytes=10000000
replica.fetch.max.bytes=10000000
default.replication.factor=2
delete.topic.enable=true
unclean.leader.election.enable=false
compression.type=snappy
inter.broker.protocol.version=2.4
{code}
 
 Let me know if you need more data from me
  
 Thanks

  was:
Hello,
 around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
2.3.1), 
 and we started noticing a troubling behavior that we didn't see before:
  
 Without apparent reason, a specific partition on a specific consumer loses its 
offset and start re-consuming the entire partition from the beginning 
(according to the retention).
  
 Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
trackingSolutionAttribution-48 to offset 1216430527.
{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range 
for partition trackingSolutionAttribution-48
{quote}
Those are the logs from the brokers at the same time (searched for 
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 
0 ms.
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
{quote}
Another way to see the same thing, from our monitoring (DD) on the partition 
offset:


 The recovery you !image-2020-04-06-13-14-47-014.png! are seeing is after I run 
partition offset reset manually (using kafka-consumer-groups.sh 
--bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 
--group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE')
  
 Any idea what can be causing this? we have it happen to us at least 5 times 
since the upgrade, and before that, I don't remember it ever happening to us.
  
 Topic config is set to default, except the retention, which is manually set to 
43200000.
 The topic has 60 partitions & a replication factor of 2. 
  
 Consumer config:
{code:java}
ConsumerConfig values:
        allow.auto.create.topics = true
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
        bootstrap.servers = [..........]
        check.crcs = true
        client.dns.lookup = default
        client.id =
        client.rack =
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = fireAttributionConsumerGroup4
        group.instance.id = null
        heartbeat.interval.ms = 10000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        security.providers = null
        send.buffer.bytes = 131072
        session.timeout.ms = 30000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = https
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
{code}
 
 Broker config (server.properties) - We have 9 brokers:
{code:java}
num.io.threads=4
num.network.threads=64
socket.send.buffer.bytes=33554432
socket.receive.buffer.bytes=33554432
socket.request.max.bytes=104857600
num.partitions=2
log.retention.hours=1
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
zookeeper.connection.timeout.ms=1000000
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.1
log.cleaner.threads=2
log.cleanup.policy=delete
log.cleaner.delete.retention.ms=86400000
log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=9223372036854775807
message.max.bytes=10000000
replica.fetch.max.bytes=10000000
default.replication.factor=2
delete.topic.enable=true
unclean.leader.election.enable=false
compression.type=snappy
inter.broker.protocol.version=2.4
{code}
 
 Let me know if you need more data from me
  
 Thanks


> Consumer loses partition offset and resets post 2.4.1 version upgrade
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-9824
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9824
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Nitay Kufert
>            Priority: Major
>         Attachments: image-2020-04-06-13-14-47-014.png
>
>
> Hello,
>  around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
> 2.3.1), 
>  and we started noticing a troubling behavior that we didn't see before:
>   
>  Without apparent reason, a specific partition on a specific consumer loses 
> its offset and start re-consuming the entire partition from the beginning 
> (according to the retention).
>   
>  Messages appearing on the consumer (client):
> {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
> trackingSolutionAttribution-48 to offset 1216430527.
> {quote}
> {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of 
> range for partition trackingSolutionAttribution-48
> {quote}
> Those are the logs from the brokers at the same time (searched for 
> "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
> {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 
> in 0 ms.
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
> {quote}
> Another way to see the same thing, from our monitoring (DD) on the partition 
> offset:
> !image-2020-04-06-13-14-47-014.png|width=530,height=152!
> The recovery you are seeing is after I run partition offset reset manually 
> (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
> trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
> --reset-offsets --to-datetime 'SOME DATE')
>   
>  Any idea what can be causing this? we have it happen to us at least 5 times 
> since the upgrade, and before that, I don't remember it ever happening to us.
>   
>  Topic config is set to default, except the retention, which is manually set 
> to 43200000.
>  The topic has 60 partitions & a replication factor of 2. 
>   
>  Consumer config:
> {code:java}
> ConsumerConfig values:
>       allow.auto.create.topics = true
>       auto.commit.interval.ms = 5000
>       auto.offset.reset = earliest
>       bootstrap.servers = [..........]
>       check.crcs = true
>       client.dns.lookup = default
>       client.id =
>       client.rack =
>       connections.max.idle.ms = 540000
>       default.api.timeout.ms = 60000
>       enable.auto.commit = true
>       exclude.internal.topics = true
>       fetch.max.bytes = 52428800
>       fetch.max.wait.ms = 500
>       fetch.min.bytes = 1
>       group.id = fireAttributionConsumerGroup4
>       group.instance.id = null
>       heartbeat.interval.ms = 10000
>       interceptor.classes = []
>       internal.leave.group.on.close = true
>       isolation.level = read_uncommitted
>       key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>       max.partition.fetch.bytes = 1048576
>       max.poll.interval.ms = 300000
>       max.poll.records = 500
>       metadata.max.age.ms = 300000
>       metric.reporters = []
>       metrics.num.samples = 2
>       metrics.recording.level = INFO
>       metrics.sample.window.ms = 30000
>       partition.assignment.strategy = [class 
> org.apache.kafka.clients.consumer.RangeAssignor]
>       receive.buffer.bytes = 65536
>       reconnect.backoff.max.ms = 1000
>       reconnect.backoff.ms = 50
>       request.timeout.ms = 30000
>       retry.backoff.ms = 100
>       sasl.client.callback.handler.class = null
>       sasl.jaas.config = null
>       sasl.kerberos.kinit.cmd = /usr/bin/kinit
>       sasl.kerberos.min.time.before.relogin = 60000
>       sasl.kerberos.service.name = null
>       sasl.kerberos.ticket.renew.jitter = 0.05
>       sasl.kerberos.ticket.renew.window.factor = 0.8
>       sasl.login.callback.handler.class = null
>       sasl.login.class = null
>       sasl.login.refresh.buffer.seconds = 300
>       sasl.login.refresh.min.period.seconds = 60
>       sasl.login.refresh.window.factor = 0.8
>       sasl.login.refresh.window.jitter = 0.05
>       sasl.mechanism = GSSAPI
>       security.protocol = PLAINTEXT
>       security.providers = null
>       send.buffer.bytes = 131072
>       session.timeout.ms = 30000
>       ssl.cipher.suites = null
>       ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>       ssl.endpoint.identification.algorithm = https
>       ssl.key.password = null
>       ssl.keymanager.algorithm = SunX509
>       ssl.keystore.location = null
>       ssl.keystore.password = null
>       ssl.keystore.type = JKS
>       ssl.protocol = TLS
>       ssl.provider = null
>       ssl.secure.random.implementation = null
>       ssl.trustmanager.algorithm = PKIX
>       ssl.truststore.location = null
>       ssl.truststore.password = null
>       ssl.truststore.type = JKS
>       value.deserializer = class 
> org.apache.kafka.common.serialization.StringDeserializer
> {code}
>  
>  Broker config (server.properties) - We have 9 brokers:
> {code:java}
> num.io.threads=4
> num.network.threads=64
> socket.send.buffer.bytes=33554432
> socket.receive.buffer.bytes=33554432
> socket.request.max.bytes=104857600
> num.partitions=2
> log.retention.hours=1
> log.segment.bytes=536870912
> log.retention.check.interval.ms=60000
> zookeeper.connection.timeout.ms=1000000
> controlled.shutdown.enable=true
> auto.leader.rebalance.enable=true
> log.cleaner.enable=true
> log.cleaner.min.cleanable.ratio=0.1
> log.cleaner.threads=2
> log.cleanup.policy=delete
> log.cleaner.delete.retention.ms=86400000
> log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
> num.recovery.threads.per.data.dir=1
> log.flush.interval.messages=9223372036854775807
> message.max.bytes=10000000
> replica.fetch.max.bytes=10000000
> default.replication.factor=2
> delete.topic.enable=true
> unclean.leader.election.enable=false
> compression.type=snappy
> inter.broker.protocol.version=2.4
> {code}
>  
>  Let me know if you need more data from me
>   
>  Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to