Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Sijie Guo Mon, 14 Jan 2013 21:06:20 -0800

>  I need to review it because I'm not sure why we currently need 3
accesses to the metadata store.


Let me clarify the 3 accesses for a close operation.

1) first read the ledger metadata. (1 metadata read)
2) if the ledger state is in open state, write ledger metadata to
IN_RECOVERY state. (1 metadata write)
3) do ledger recovery procedure
4) update last entry id and change state to CLOSED. (1 metadata write)

I am not very sure about whether we could get ride of step 2). But at
least, we still have 1 read and 1 write for a close operation. (NOTE: close
here is not create a ledger and close it. it is the *close* when open to
recover)

> According to your analysis, the size of the spike is linear on the number
of topics the faulty hub owns. Your proposal just changes the constant, but
it is still linear, yes?

from mathematical, it is still linear. but in reality, it would be much
much better. And the number of topics is the factor we could not change for
a application that requires so much topics.

>  In fact, isn't it the case that you can do at least some of what you're
proposing with managed ledgers?

Actually, this is not the whole thing of managed ledgers. managed ledgers
is the thing that BookKeeper persistence manager with Subscriptions manager
(cursor) in Hedwig. managed ledgers still have the same problems I raised
in the proposal.

The big problem for current bookkeeper is applications (either Hedwig or
managed ledger) needs to find extra places to record the ledger ids
generated by bookkeeper. It is not efficient and also duplicates the
metadata storage, especially for a system facing large number of topics or
ledgers, it is Achilles' Heel.

> It adds a number of new concurrent scenarios, increasing the complexity
of what we expose, and it violates one principle that we used when first
designing this system

Could you point out the concurrent scenarios and complexity I added in the
proposal? I am not very sure about that.

I took care of re-using existing concepts with minmum changes to extend the
api to provide more flexibility and efficiency for applications. I don't
think it violates the principle to make the system a bare minmum. For
example, I don't add cursor concept in ledger. The ledger is still could be
read many times. How to shrink a ledger depends on applications. Hedwig or
managed ledger would record their cursors (subscriber states) and shrink a
ledger when they decided to do that.


On Mon, Jan 14, 2013 at 1:06 AM, Flavio Junqueira <[email protected]>wrote:

> Thanks for your proposal, Sijie. The cost of owning a topic seems indeed
> high, and your motivation to raise that point seems to be that upon hub
> crashes there will be a spike of accesses to the metadata store. According
> to your analysis, the size of the spike is linear on the number of topics
> the faulty hub owns. Your proposal just changes the constant, but it is
> still linear, yes?
>
> Although I think I understand the motivation for your proposal, I'm not
> really in favor of extending the BK API like this. It adds a number of new
> concurrent scenarios, increasing the complexity of what we expose, and it
> violates one principle that we used when first designing this system, which
> is keeping functionality to a bare minimum so that we can implement it
> efficiently. Other functionality needed can be implemented on top. In fact,
> isn't it the case that you can do at least some of what you're proposing
> with managed ledgers?
>
> One point you raised that concerns me a bit is the cost of a close
> operation. I need to review it because I'm not sure why we currently need 3
> accesses to the metadata store. It should really be just one in the regular
> case (no concurrent attempts to close a ledger through ledger recovery). I
> agree though that we need to think about how to deal efficiently with the
> hedwig issue you're raising.
>
> -Flavio
>
> On Jan 14, 2013, at 4:53 AM, Sijie Guo <[email protected]> wrote:
>
> > Hello all,
> >
> > Currently Hedwig used *ledgers* to store messages for a topic. It
> requires
> > lots of metadata operations when a hub server owned a topic. These
> metadata
> > operations are:
> >
> >   1. read topic persistence info for a topic. (1 metadata read operation)
> >   2. close the last opened ledger. (1 metadata read operation, 2 metadata
> >   write operations)
> >   3. create a new ledger to write. (1 metadata write operation)
> >   4. update topic persistence info fot the topic to track the new ledger.
> >   (1 metadata write operation)
> >
> > so there are at least 2 metadata read operations and 4 metadata write
> > operations when acquiring a topic. if a hub server owned lots of topics
> > restarts, it would introduce a spike of metadata accesses to the metadata
> > storage (e.g. ZooKeeper).
> >
> > Currently hedwig's design is originated from ledger's *"write once, read
> > many"* semantic.
> >
> >   1. Ledger id is generated by bookkeeper. Hedwig needs to record ledger
> >   id in extra places, which introduce extra metadata accesses.
> >   2. A ledger could not wrote any more entries after it was closed => so
> >   hedwig has to create a new ledger to write new entries after the
> ownership
> >   of a topic is changed (e.g. hub server failure, topic release).
> >   3. A ledger's entries could not be *deleted* only after a ledger is
> >   deleted => so hedwig has to change ledgers, which let entries could be
> >   consumed by *deleting* ledger after all subscribers consumed.
> >
> > I proposed two new apis accompanied with "re-open, append" semantic in
> > BookKeeper, for high performance metadata access and easy metadata
> > management for applications.
> >
> > public void openLedger(String ledgerName, DigestType digestType,
> > byte[] passwd, Mode mode);
> >
> > *Mode* indicates the access mode of a ledger, which would be *O_CREATE*,
> *
> > O_APPEND*, *O_RDONLY*.
> >
> >   - O_CREATE: create a new ledger with the given ledger name. if there is
> >   a ledger existed already, fail the creation. similar as createLedger
> now.
> >   - O_APPEND: open a new ledger with the given ledger name and continue
> >   write entries.
> >   - O_RDONLY: open a new ledger w/o changing any state just reading
> >   entries already persisted. similar as openLedgerNoRecovery now.
> >
> > *ledgerName* indicates the name of a ledger. user could pick up either
> name
> > he likes, so he could manage his ledgers in his way like introducing
> > namespace over it, instead of bookkeeper generatating ledger id for them.
> > (in most of cases, application needs to find another place to store the
> > generated ledger id. the practise is really bad)
> >
> > public void shrink(long endEntryId, boolean force) throws BKException;
> >
> > *Shrink* means cutting the entries starting from *startEntryId* to *
> > endEntryId* (endEntryId is non-inclusive). *startEntryId*is implicit in
> > ledger metadata, which is 0 for a non-shrinked ledger, while it is *
> > endEntryId* from previous valid shrink.
> >
> > 'Force' flag indicate whether to issue garbage collection request after
> we
> > just move the *startEntryId* to *endEntryId*. If the flag is true, we
> issue
> > garbage collection request to notify bookie server to do garbage
> > collection; otherwise, we just move *startEntryId* to *endEntryId*. This
> > feature might be useful for some applications. Take Hedwig for example,
> we
> > could leverage this feature not to store the subscriber state for those
> > topics which have only one subscriber for each. Each time after specific
> > number of messages consumed, we move the entry point by*shrink(entryId,
> > false)*. After several messages consumed, we garbage collected them by
> > *shrink(entryId,
> > true)*.
> >
> > Using *shrink*, application could relaim the disk space occupied by a
> > ledger w/o creating new ledger and deleting old one.
> >
> > These two operations are based on two mechanisms: one is 'session
> fencing',
> > and the other one is 'improved garbage collection (BOOKKEEPER-464)'.
> > Details are in the gist https://gist.github.com/4520260 . I would try to
> > start working on some drafts based on the idea to demonstrate its
> > correctness.
> >
> > Welcome for comments and discussions.
> > -Sijie
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Reply via email to