Lucas Bradstreet created KAFKA-9393:
---------------------------------------

             Summary: DeleteRecords triggers extreme lock contention for large 
partition directories
                 Key: KAFKA-9393
                 URL: https://issues.apache.org/jira/browse/KAFKA-9393
             Project: Kafka
          Issue Type: Improvement
    Affects Versions: 2.4.0, 2.3.0, 2.2.0
            Reporter: Lucas Bradstreet


DeleteRecords, frequently used by KStreams triggers a 
Log.maybeIncrementLogStartOffset call, calling 
kafka.log.ProducerStateManager.listSnapshotFiles which calls 
java.io.File.listFiles on the partition dir. The time taken to list this 
directory can be extreme for partitions with many small segments (e.g 20000) 
taking multiple seconds to finish. This causes lock contention for the log, and 
if produce requests are also occurring for the same log can cause a majority of 
request handler threads to become blocked waiting for the DeleteRecords call to 
finish.

I believe this is a problem going back to the initial implementation of the 
transactional producer, but I need to confirm how far back it goes.

One possible solution is to maintain a producer state snapshot aligned to the 
log segment, and simply delete it whenever we delete a segment. This would 
ensure that we never have to perform a directory scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to