[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497139#comment-13497139
 ] 

Sijie Guo commented on BOOKKEEPER-249:
--------------------------------------

{quote}
Please go through the corner case in the ZKLedgerManager impl and causing data 
loss like, successfully updated the failed bookie's ledgers in 
'/ledgers/deleted/Bi' and while changing ensemble for auto-rereplication zk got 
disconnected. Now Bi will go ahead with ledger deletion if it rejoins
{quote}

first, I clarify a zombie entry is an entry was added to a bookie which could 
not be GC'ed. In the proposed algorithm, an zombie entry is the entry that are 
not referred in ledger metadata, since proposed algorithm is metadata-based gc 
algorithm.

second, '/ledgers/deleted/Bi' is kind of gc index : bookie -> ledgers.

third, changing ensemble would happen at following two cases:

1) changing ensemble when writing entries failed for a writer, which is update 
ledger metadata first and write entries later. (in the case, failed bookie 
would introduce zombie entry. e.g BK1, BK2, BK3 changed ensemble to BK1, BK4, 
BK3. BK2 is the bookie introduced zombie entry.)
2) changing ensemble when auto-replicating, which is write entries first and 
update ledger metadata later. (in the case, failed bookie and replaced bookie 
would both introduce zombie entry. e.g BK2 is failed in the ensemble BK1, BK2, 
BK3, we re-replicate entries belongs to BK2 to BK4. both BK2 and BK4 would 
introduce zombie entry.)

for the proposed GC algorithm, we need to track those bookies which would 
introduce zombie entries. The idea is we should add the bookies that would 
introduce zombie entry to the deleted index '/ledgers/deleted/Bi' when updating 
ensemble.

GC thread should go thru the deleted index and check the ledgers' metadata. GC 
thread would gc the ledger only when if the ledger is DELETED or doesn't exist. 
so I don't think it would case data loss.

But there was still a corner case we need to take care about auto-rereplication:

1) autorerep L, replace A with B. both A and B would be added to gc index.
2) client deleted L, autorerep is still replicating entries from A to B.
3) B runs gc, gc ledger L.
4) B would still have zombie entries since autorerep still writes entries to B 
even B finished gc ledger L.

one possible solution for this is when autorerep finished replicating entries 
from A to B and update ensemble. it would see BadVersion or NoSuchLedgerExists, 
it should add B again to gc index. But it was getting the algorithm messy.

The simple way for zombie entries, I believe would be running different gc 
algorithms in different granularity. For most case, we could run the proposed 
algorithm to do fast gc. When a bookie's disk space went out (suppose reached a 
specific threshold), the bookie runs a FULL GC. The FULL GC is the POLLING 
based gc algorithm, which takes the ledgers owned by the bookie and check 
whether the ledger was deleted or not and gc those deleted. Mostly zombie 
entries would be quite rare and we had compaction to take care of disk space. 
Only when a bookie's disk went out, we need to do a FULL GC to check whether it 
was caused by zombie entries.
 



                
> Revisit garbage collection algorithm in Bookie server
> -----------------------------------------------------
>
>                 Key: BOOKKEEPER-249
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-249
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>            Reporter: Sijie Guo
>             Fix For: 4.2.0
>
>         Attachments: gc_revisit.pdf
>
>
> Per discussion in BOOKKEEPER-181, it would be better to revisit garbage 
> collection algorithm in bookie server. so create a subtask to focus on it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to