Hello all, Currently Hedwig used *ledgers* to store messages for a topic. It requires lots of metadata operations when a hub server owned a topic. These metadata operations are:
1. read topic persistence info for a topic. (1 metadata read operation) 2. close the last opened ledger. (1 metadata read operation, 2 metadata write operations) 3. create a new ledger to write. (1 metadata write operation) 4. update topic persistence info fot the topic to track the new ledger. (1 metadata write operation) so there are at least 2 metadata read operations and 4 metadata write operations when acquiring a topic. if a hub server owned lots of topics restarts, it would introduce a spike of metadata accesses to the metadata storage (e.g. ZooKeeper). Currently hedwig's design is originated from ledger's *"write once, read many"* semantic. 1. Ledger id is generated by bookkeeper. Hedwig needs to record ledger id in extra places, which introduce extra metadata accesses. 2. A ledger could not wrote any more entries after it was closed => so hedwig has to create a new ledger to write new entries after the ownership of a topic is changed (e.g. hub server failure, topic release). 3. A ledger's entries could not be *deleted* only after a ledger is deleted => so hedwig has to change ledgers, which let entries could be consumed by *deleting* ledger after all subscribers consumed. I proposed two new apis accompanied with "re-open, append" semantic in BookKeeper, for high performance metadata access and easy metadata management for applications. public void openLedger(String ledgerName, DigestType digestType, byte[] passwd, Mode mode); *Mode* indicates the access mode of a ledger, which would be *O_CREATE*, * O_APPEND*, *O_RDONLY*. - O_CREATE: create a new ledger with the given ledger name. if there is a ledger existed already, fail the creation. similar as createLedger now. - O_APPEND: open a new ledger with the given ledger name and continue write entries. - O_RDONLY: open a new ledger w/o changing any state just reading entries already persisted. similar as openLedgerNoRecovery now. *ledgerName* indicates the name of a ledger. user could pick up either name he likes, so he could manage his ledgers in his way like introducing namespace over it, instead of bookkeeper generatating ledger id for them. (in most of cases, application needs to find another place to store the generated ledger id. the practise is really bad) public void shrink(long endEntryId, boolean force) throws BKException; *Shrink* means cutting the entries starting from *startEntryId* to * endEntryId* (endEntryId is non-inclusive). *startEntryId*is implicit in ledger metadata, which is 0 for a non-shrinked ledger, while it is * endEntryId* from previous valid shrink. 'Force' flag indicate whether to issue garbage collection request after we just move the *startEntryId* to *endEntryId*. If the flag is true, we issue garbage collection request to notify bookie server to do garbage collection; otherwise, we just move *startEntryId* to *endEntryId*. This feature might be useful for some applications. Take Hedwig for example, we could leverage this feature not to store the subscriber state for those topics which have only one subscriber for each. Each time after specific number of messages consumed, we move the entry point by*shrink(entryId, false)*. After several messages consumed, we garbage collected them by *shrink(entryId, true)*. Using *shrink*, application could relaim the disk space occupied by a ledger w/o creating new ledger and deleting old one. These two operations are based on two mechanisms: one is 'session fencing', and the other one is 'improved garbage collection (BOOKKEEPER-464)'. Details are in the gist https://gist.github.com/4520260 . I would try to start working on some drafts based on the idea to demonstrate its correctness. Welcome for comments and discussions. -Sijie
