[
https://issues.apache.org/jira/browse/KAFKA-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558971#comment-13558971
]
Jay Kreps commented on KAFKA-631:
---------------------------------
I did some testing on the I/O throttling and verified that this does indeed
maintain the expected I/O rate. Two gotchas in this, first you can't look at
iostat because the OS will batch up writes and then asynchronously flush them
out at a rate greater than what we requested. Second since the limit is on read
and write combined a limit of 5MB/sec will lead to the offset map building
happening at exactly 5MB/sec but the cleaning will be closer to 2.5MB/sec
because cleaning involves first reading in messages then writing them back out
so 1MB of cleaning does 2MB of I/O (assuming 100% retention).
> Implement log compaction
> ------------------------
>
> Key: KAFKA-631
> URL: https://issues.apache.org/jira/browse/KAFKA-631
> Project: Kafka
> Issue Type: New Feature
> Components: core
> Affects Versions: 0.8.1
> Reporter: Jay Kreps
> Assignee: Jay Kreps
> Attachments: KAFKA-631-v1.patch, KAFKA-631-v2.patch
>
>
> Currently Kafka has only one way to bound the space of the log, namely by
> deleting old segments. The policy that controls which segments are deleted
> can be configured based either on the number of bytes to retain or the age of
> the messages. This makes sense for event or log data which has no notion of
> primary key. However lots of data has a primary key and consists of updates
> by primary key. For this data it would be nice to be able to ensure that the
> log contained at least the last version of every key.
> As an example, say that the Kafka topic contains a sequence of User Account
> messages, each capturing the current state of a given user account. Rather
> than simply discarding old segments, since the set of user accounts is
> finite, it might make more sense to delete individual records that have been
> made obsolete by a more recent update for the same key. This would ensure
> that the topic contained at least the current state of each record.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira