Joel Koshy created KAFKA-2664:
---------------------------------

             Summary: Adding a new metric with several pre-existing metrics is 
very expensive
                 Key: KAFKA-2664
                 URL: https://issues.apache.org/jira/browse/KAFKA-2664
             Project: Kafka
          Issue Type: Bug
            Reporter: Joel Koshy
             Fix For: 0.9.0.1


I know the summary sounds expected, but we recently ran into a socket server 
request queue backup that I suspect was caused by a combination of improperly 
implemented applications that reconnect with a different (random) client-id 
each time; and the fact that for quotas we now register a new quota metric-set 
for each client-id.

So here is what happened: a broker went down and a handful of other brokers 
starting seeing queue times go up significantly. This caused the request queue 
to backup, which caused socket timeouts and a further deluge of reconnects. The 
only way we could get out of this was to fire-wall the broker and downgrade to 
a version without quotas (or I think it would have worked to just restart the 
broker).

My guess is that there were a ton of pre-existing client-id metrics. I don’t 
know for sure but I’m basing that on the fact that there were several new 
unique client-ids showing up in the public access logs and request local times 
for fetches started going up inexplicably. (It would have been useful to have a 
metric for the number of metrics.) So it turns out that in the above scenario 
(with say 50k pre-existing client-ids), the avg local time for fetch can go up 
to the order of 50-100ms (at least with tests on a linux box) largely due to 
the time taken to create new metrics; and that’s because we use a copy-on-write 
map underneath. If you have enough (say, hundreds) of clients re-connecting at 
the same time with new client-id's, that can cause the request queues to start 
backing up and the overall queuing system to become unstable; and the line 
starts to spill out of the building.

I think this is a fairly new scenario with quotas - i.e., I don’t think the 
past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
close.

To be clear, the clients are clearly doing the wrong thing but I think the 
broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to