[ 
https://issues.apache.org/jira/browse/KAFKA-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5829:
----------------------------
    Priority: Critical  (was: Major)

> Speedup broker startup after unclean shutdown by reducing unnecessary 
> snapshot files deletion
> ---------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5829
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5829
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> The current Kafka implementation will cause slow startup after unclean 
> shutdown. The time to load a partition will be 10X or more than what it 
> actually needs. Here is the explanation with example:
> - Say we have a partition of 20 segments, each segment has 250 message 
> starting with offset 0. And each message has 1 MB bytes.
> - Broker experiences hard kill and the index file of the first segment is 
> corrupted.
> - When broker startup and load the first segment, it realizes that the index 
> of the first segment is corrupted. So it calls `log.recoverSegment(...)` to 
> recover this segment. This method will call 
> `stateManager.truncateAndReload(...)` which deletes the snapshot files whose 
> offset is larger than base offset of the first segment. Thus all snapshot 
> files are deleted.
> - To rebuild the snapshot files, the `log.loadSegmentFiles(...)` will have to 
> read every message in this partition even if their log and index files are 
> not corrupted. This will increase the time to load this partition by more 
> than an order of magnitude.
> In order to address this issue, one simple solution is not to delete snapshot 
> files that are than the given offset if only the index files needs re-build. 
> More specifically, we should not need to re-build producer state offset file 
> unless the log file itself is corrupted or truncated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to