Joel Koshy created KAFKA-1911:
---------------------------------
Summary: Log deletion on stopping replicas should be async
Key: KAFKA-1911
URL: https://issues.apache.org/jira/browse/KAFKA-1911
Project: Kafka
Issue Type: Bug
Components: log, replication
Reporter: Joel Koshy
Assignee: Jay Kreps
Fix For: 0.8.3
If a StopReplicaRequest sets delete=true then we do a file.delete on the file
message sets. I was under the impression that this is fast but it does not seem
to be the case.
On a partition reassignment in our cluster the local time for stop replica took
nearly 30 seconds.
{noformat}
Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467;
ClientId: ; DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 53
from
client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0
{noformat}
This ties up one API thread for the duration of the request.
Specifically in our case, the queue times for other requests also went up and
producers to the partition that was just deleted on the old leader took a while
to refresh their metadata (see KAFKA-1303) and eventually ran out of retries on
some messages leading to data loss.
I think the log deletion in this case should be fully asynchronous although we
need to handle the case when a broker may respond immediately to the
stop-replica-request but then go down after deleting only some of the log
segments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)