[
https://issues.apache.org/jira/browse/FLUME-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742947#comment-13742947
]
Hari Shreedharan commented on FLUME-2155:
-----------------------------------------
Interestingly, I think we maybe able to get rid of CheckpointRebuilder etc. In
case of a full replay, we can optimize this by not going to the queue to do
removes, thus not requiring a compaction (we can simply remove from the
putBuffer - if we don't find it there, buffer it - these are buffered in normal
replay too as pendingTakes). This would perform at least as well as fast
replay, but would require less memory because we are reading data in the order
it was written. This also requires less memory than normal replay too
(surprisingly!), because the puts are buffered inside the queue anyway and
takes not in the queue are buffered as pendingTakes. In the normal case, it
requires more memory than normal replay (since we buffer the takes), but this
is likely to be acceptable.
[~brocknoland] Does that make sense? Remove checkpointrebuilder, and make this
fast replay. I think this can also be made default, because the memory usage is
not substantially higher.
> Improve replay time
> -------------------
>
> Key: FLUME-2155
> URL: https://issues.apache.org/jira/browse/FLUME-2155
> Project: Flume
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Hari Shreedharan
> Attachments: 100000-110000, 10000-20000, 300000-310000,
> 700000-710000, fc-test.patch, SmartReplay1.1.pdf, SmartReplay.pdf
>
>
> File Channel has scaled so well that people now run channels with sizes in
> 100's of millions of events. Turns out, replay can be crazy slow even between
> checkpoints at this scale - because of the remove() method in FlumeEventQueue
> moving every pointer that follows the one being removed (1 remove causes 99
> million+ moves for a channel of 100 million!). There are several ways of
> improving - one being move at the end of replay - sort of like a compaction.
> Another is to use the fact that all removes happen from the top of the queue,
> so move the first "k" events out to hashset and remove from there - we can
> find k using the write id of the last checkpoint and the current one.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira