[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249732#comment-13249732
 ] 

Sijie Guo commented on BOOKKEEPER-212:
--------------------------------------

did some investigation on this issue, I found that it is caused due to ledger 
cache deletes index file first then closes it.

{code}
        FileInfo fi = null;
        try {
            fi = getFileInfo(ledgerId, null);
            fi.delete();
            fi.close(false);
        } finally {
            // should release use count
            // otherwise the file channel would not be closed.
            if (null != fi) {
                fi.release();
            }   
        }
{code}

in common case, the above code is OK. since we use *useCount*, we can ensure 
file existed if the useCount is not zero. but if the file info is evicted from 
ledger cache before, when ledger cache wants to delete such file info, it tries 
to get file info again but doesn't open the file channel, so fi#delete() will 
delete index file immediately. when we tried to close it again, it would fail 
due to FileNotFound exception.

fail to removed deleted ledger file info will exhaust openFileLimit, so 
following creating ledgers would fail because we don't have buffer to open new 
ledger file info.

the fix is quite simply, just need to change the order of fi#delete and 
fi#close. I would attach a test case and a patch later.
                
> Bookie stops responding when creating and deleting many ledgers
> ---------------------------------------------------------------
>
>                 Key: BOOKKEEPER-212
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-212
>             Project: Bookkeeper
>          Issue Type: Bug
>            Reporter: Flavio Junqueira
>            Priority: Blocker
>             Fix For: 4.1.0
>
>         Attachments: bookkeeper-server.log
>
>
> I have written down a short app to try to reproduce one problematic case 
> reported on the user list. The app does the following:
> # It creates initially a number of ledgers, say 2000;
> # Once it reaches 2000, for each new ledger it creates, it deletes the one at 
> the head of the list;
> # Before closing the ledger, it adds 5 entries of 1k, just to generate some 
> traffic for any given ledger.
> What I tried to achieve is to have thousands of active ledgers and delete new 
> ledgers as I create new ones. I'll post a link to my test code later. 
> At some point, one bookie stops responding. The bookie seems to be up, but it 
> is not responsive. Looking at the logs, this is what I see:
> {noformat}
> 2012-04-06 12:22:05,765 - INFO  [SyncThread:LedgerCacheImpl@682] - Ledger 
> 1726 is evicted from file info cache.
> 2012-04-06 12:22:05,769 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1727 is evicted from file info cache.
> 2012-04-06 12:22:05,772 - INFO  [SyncThread:LedgerCacheImpl@682] - Ledger 
> 1728 is evicted from file info cache.
> 2012-04-06 12:22:05,780 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1729 is evicted from file info cache.
> 2012-04-06 12:22:05,787 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1730 is evicted from file info cache.
> 2012-04-06 12:22:05,794 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1731 is evicted from file info cache.
> 2012-04-06 12:22:05,801 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1732 is evicted from file info cache.
> 2012-04-06 12:22:05,807 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1733 is evicted from file info cache.
> 2012-04-06 12:22:05,822 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1734 is evicted from file info cache.
> 2012-04-06 12:22:05,828 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1735 is evicted from file info cache.
> 2012-04-06 12:22:05,842 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1736 is evicted from file info cache.
> 2012-04-06 12:22:05,851 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1737 is evicted from file info cache.
> 2012-04-06 12:22:05,856 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1738 is evicted from file info cache.
> 2012-04-06 12:22:05,864 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1739 is evicted from file info cache.
> 2012-04-06 12:22:05,874 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1740 is evicted from file info cache.
> 2012-04-06 12:22:05,885 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1741 is evicted from file info cache.
> 2012-04-06 12:22:05,894 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1742 is evicted from file info cache.
> 2012-04-06 12:22:05,902 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - 
> Ledger 1743 is evicted from file info cache.
> 2012-04-06 12:22:05,987 - INFO  [GarbageCollectorThread:LedgerCacheImpl@682] 
> - Ledger 1744 is evicted from file info cache.
> 2012-04-06 12:22:05,987 - ERROR 
> [GarbageCollectorThread:GarbageCollectorThread$1@244] - Exception when 
> deleting the ledger index file on the Bookie: 
> java.io.IOException: /home/fpj/bk/current/1/b/10b.idx not found
>         at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:118)
>         at org.apache.bookkeeper.bookie.FileInfo.close(FileInfo.java:194) 
>         at 
> org.apache.bookkeeper.bookie.LedgerCacheImpl.deleteLedger(LedgerCacheImpl.java:619)
>  
>         at 
> org.apache.bookkeeper.bookie.GarbageCollectorThread$1.gc(GarbageCollectorThread.java:242)
>         at 
> org.apache.bookkeeper.meta.AbstractZkLedgerManager.doGc(AbstractZkLedgerManager.java:274)
>         at 
> org.apache.bookkeeper.meta.FlatLedgerManager.garbageCollectLedgers(FlatLedgerManager.java:168)
>         at 
> org.apache.bookkeeper.bookie.GarbageCollectorThread.doGcLedgers(GarbageCollectorThread.java:237)
>         at 
> org.apache.bookkeeper.bookie.GarbageCollectorThread.run(GarbageCollectorThread.java:206)
>  
> 2012-04-06 12:22:05,987 - INFO  [GarbageCollectorThread:LedgerCacheImpl@682] 
> - Ledger 1745 is evicted from file info cache.
> {noformat}
> There are lots of exceptions like that, but otherwise I don't see anything 
> that could make the bookie unresponsive. I'll upload the logs as well.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to