[
https://issues.apache.org/jira/browse/KAFKA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manikumar resolved KAFKA-1712.
------------------------------
Resolution: Fixed
Fixed via https://issues.apache.org/jira/browse/KAFKA-2511
> Excessive storage usage on newly added node
> -------------------------------------------
>
> Key: KAFKA-1712
> URL: https://issues.apache.org/jira/browse/KAFKA-1712
> Project: Kafka
> Issue Type: Bug
> Components: log
> Reporter: Oleg Golovin
> Priority: Major
>
> When a new node is added to cluster data starts replicating into it. The
> mtime of creating segments will be set on the last message being written to
> them. Though the replication is a prolonged process, let's assume (for
> simplicity of explanation) that their mtime is very close to the time when
> the new node was added.
> After the replication is done, new data will start to flow into this new
> node. After `log.retention.hours` the amount of data will be 2 *
> daily_amount_of_data_in_kafka_node (first one is the replicated data from
> other nodes when the node was added (let us call it `t1`) and the second is
> the amount of replicated data from other nodes which happened from `t1` to
> `t1 + log.retention.hours`). So by that time the node will have twice as much
> data as the other nodes.
> This poses a big problem to us as our storage is chosen to fit normal amount
> of data (not twice this amount).
> In our particular case it poses another problem. We have an emergency segment
> cleaner which runs in case storage is nearly full (>90%). We try to balance
> the amount of data for it not to run to rely solely on kafka internal log
> deletion, but sometimes emergency cleaner runs.
> It works this way:
> - it gets all kafka segments for the volume
> - it filters out last segments of each partition (just to avoid unnecessary
> recreation of last small-size segments)
> - it sorts them by segment mtime
> - it changes mtime of the first N segements (with the lowest mtime) to 1, so
> they become really really old. Number N is chosen to free specified
> percentage of volume (3% in our case). Kafka deletes these segments later
> (as they are very old).
> Emergency cleaner works very well. Except for the case when the data is
> replicated to the newly added node.
> In this case segment mtime is the time the segment was replicated and does
> not reflect the real creation time of original data stored in this segment.
> So in this case kafka emergency cleaner will delete segments with the lowest
> mtime, which may hold the data which is much more recent than the data in
> other segments.
> This is not a big problem until we delete the data which hasn't been fully
> consumed.
> In this case we loose data and this makes it a big problem.
> Is it possible to retain segment mtime during initial replication on a new
> node?
> This will help not to load the new node with the twice as large amount of
> data as other nodes have.
> Or maybe there are another ways to sort segments by data creation times (or
> close to data creation time)? (for example if this ticket is implemented
> https://issues.apache.org/jira/browse/KAFKA-1403, we may take time of the
> first message from .index). In our case it will help with kafka emergency
> cleaner, which will be deleting really the oldest data.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)