[jira] [Commented] (FLUME-2155) Improve replay time

Hari Shreedharan (JIRA) Tue, 13 Aug 2013 20:11:44 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739172#comment-13739172
 ]


Hari Shreedharan commented on FLUME-2155:
-----------------------------------------

Relevant numbers from each run:
10000-20000:
{code}
Time in searches: 7489
Time in moves: 14021
Total # of moves: 99999999
{code}
100000-110000:
{code}
Time in searches: 68949
Time in moves: 134158
Total # of moves: 1000089999
{code}
300000-310000:
{code}
Time in searches: 206698
Time in moves: 433623
Total # of moves: 3000289999
{code}
700000-710000:
Time in searches: 461028
Time in moves: 425380
Total # of moves: 2950295000
{code}

As you can see searches and moves are both expensive (moves really are still 
traversals doing some extra work). The reason the searches are more expensive 
than moves in the 700K-710K is because searches traverse more of the queue each 
time and the moves are smarter because we just pull events up from the bottom. 
The files I attached have details of how much time each search and move takes.

The new algorithm pretty much reduces the traversals to 2 (one for marking the 
events and one for compaction) and number of moves to a max of total number of 
events in the channel (really it is less than this, it is really the highest 
index of a removed element).
                
> Improve replay time
> -------------------
>
>                 Key: FLUME-2155
>                 URL: https://issues.apache.org/jira/browse/FLUME-2155
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: 100000-110000, 10000-20000, 300000-310000, 
> 700000-710000, fc-test.patch, SmartReplay.pdf, SmartReplay.pdf
>
>
> File Channel has scaled so well that people now run channels with sizes in 
> 100's of millions of events. Turns out, replay can be crazy slow even between 
> checkpoints at this scale - because of the remove() method in FlumeEventQueue 
> moving every pointer that follows the one being removed (1 remove causes 99 
> million+ moves for a channel of 100 million!). There are several ways of 
> improving - one being move at the end of replay - sort of like a compaction. 
> Another is to use the fact that all removes happen from the top of the queue, 
> so move the first "k" events out to hashset and remove from there - we can 
> find k using the write id of the last checkpoint and the current one. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-2155) Improve replay time

Reply via email to