[ 
https://issues.apache.org/jira/browse/FLUME-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841493#comment-13841493
 ] 

Hari Shreedharan commented on FLUME-2155:
-----------------------------------------

Nice work! That makes sense - and nice find too. I was indeed looking at an 
off-heap data structure as a possible solution, but didn't know or have any 
experience using one. I really don't have a channel where I can see the copy 
being terribly slow, but I think we can take it one step at a time. If we can 
fix the issue you found - where searching for takes without puts is taking too 
long, then we can actually know if copy is problematic (once this patch is in, 
we'd see slowness in replay if copy is still an issue). 

I am also wondering if mapdb can be used to make searches faster too. The patch 
you provided fixes the specific issue of missing puts for takes (possibly 
because the old files got deleted), but if we can use Mapdb as an index, we can 
find the index of the FEP in the queue and remove it from the queue too (and 
then copy or compact in the end, either way is fine). Looks like mapdb can give 
us a map instead of a set - which can be used for this case to make searches 
faster.

> Improve replay time
> -------------------
>
>                 Key: FLUME-2155
>                 URL: https://issues.apache.org/jira/browse/FLUME-2155
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: 10000-20000, 100000-110000, 300000-310000, 
> 700000-710000, FLUME-2155-initial.patch, FLUME-2155.patch, 
> FLUME-FC-SLOW-REPLAY-1.patch, FLUME-FC-SLOW-REPLAY-FIX-1.patch, 
> SmartReplay.pdf, SmartReplay1.1.pdf, fc-test.patch
>
>
> File Channel has scaled so well that people now run channels with sizes in 
> 100's of millions of events. Turns out, replay can be crazy slow even between 
> checkpoints at this scale - because of the remove() method in FlumeEventQueue 
> moving every pointer that follows the one being removed (1 remove causes 99 
> million+ moves for a channel of 100 million!). There are several ways of 
> improving - one being move at the end of replay - sort of like a compaction. 
> Another is to use the fact that all removes happen from the top of the queue, 
> so move the first "k" events out to hashset and remove from there - we can 
> find k using the write id of the last checkpoint and the current one. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to