Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Sijie Guo Wed, 16 Jan 2013 22:59:18 -0800

> If this observation is correct, can we acquire those topics lazily
instead of eagerly?

could you add entries to a ledger without opening ledger?

> Another approach would be to group topics into buckets and have hubs
acquiring buckets of topics, all topics of a bucket writing to the same
ledgers. There is an obvious drawback to this approach when reading
messages for a given topic.

for topic ownership, we could proceed it in groups as I proposed in
BOOKKEEPER-508.
for topic persistence info, we should not do that, otherwise it volatile
the contract that Hedwig made and you add complexity into Hedwig itself.
Grouping could be considered in application side, but if application could
not or is difficult to do that, we still have to face such many topics to
make Hedwig as a scalable platform.

I don't like what you said 'the number of accesses to the metadata is still
linear on the number of topics'. Metadata accesses problem is an 'entity
locating' problem. Two categories of solutions are used to locate an
entity, one is mapping, the other one is computation.

Mapping means you need to find a place to store this entity or store the
location of this entity. Like HDFS NameNode storing the locations of blocks
in Namenode. If an entity existed in this way, you had at least one access
to find it. no matter it was stored in ZooKeeper, in memory, embedded in
bookie servers or grouped in somewhere. You could not make any changes to
the 'linear' relationship between number of accesses and number of
entities. To make it more clearly, in HDFS, 'entity' means a data block, if
you had an entity location, read data is fast, but cross boundary of
blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means
a topic, if the topic is owned, read messages in this topic is fast. But in
either case, the number of accesses is still linear on the number of
entities.

Computation means I could compute the location of an entity based on its
name, without looking up somewhere. E.g, Consistent Hashing.

Currently, BookKeeper only gives a ledger id, so Hedwig has to find a place
to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
one is topic name to Hedwig's metadata, the other is Hedwig metadata to
Ledger's metadata. You could not get rid of 'linear' limitation w/o
breaking the mapping between Hedwig metadata and Ledger metadata.

What I proposed is still 'linear'. But it consolidates these two metadata
into one, and breaks the mapping between Hedwig metadata and Ledger
metadata. Although Hedwig still requires one lookup from topic name to
metadata, it gives the opportunity to get rid of this lookup in future
since we could use 'ledger name' to access its entries now.

An idea like 'placement group' in ceph could be deployed in BookKeeper in
future, which grouping ledgers to share same ensemble configuration. It
would reduce the number of mappings between ledger name to ledger metadata.
but it should be a different topic to discuss. All what I said here is just
to demonstrate the proposal I made here could make the improvement proceed
in right direction.

On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <[email protected]>wrote:

> Thanks for the great discussion so far. I have a couple of comments below:
>
> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote:
>
> > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>> Originally, it was meant to have a number of
> >>> long lived subscriptions, over which a lot of data travelled. Now the
> >>> load has flipped to a large number of short lived subscriptions, over
> >>> which relatively little data travels.
> >>
> >> The topic discussed here doesn't relate to hedwig subscriptions, it just
> >> about how hedwig use ledgers to store its messages.  Even there are no
> >> subscriptions, the problem is still there. The restart of a hub server
> >> carrying large number of topics would hit the metadata storage with many
> >> accesses. The hit is a hub server acquiring a topic, no matter the
> >> subscription is long lived or short lived. after topic is acquired,
> >> following accesses are in memory, which doesn't cause any
> >> performance issue.
> > I was using topics and subscriptions to mean the same thing here due
> > to the usecase we have in Yahoo where they're effectively the same
> > thing. But yes, I should have said topic. But my point still
> > stands. Hedwig was designed to deal with fewer topics, which had a lot
> > of data passing through them, rather than more topics, with very
> > little data passing though them. This is why zk was consider suffient
> > at that point, as tens of thousands of topics being recovered really
> > isn't an issue for zk. The point I was driving at is that, the usecase
> > has changed in a big way, so it may require a big change to handle it.
> >
>
> About the restart of a hub, if I understand the problem, many topics will
> be acquired concurrently, inducing the load spike you're mentioning. If
> this observation is correct, can we acquire those topics lazily instead of
> eagerly?
>
> >> But we should separate the capacity problem from the software problem. A
> >> high performance and scalable metadata storage would help for resolving
> >> capacity problem. but either implementing a new one or leveraging a high
> >> performance one doesn't change the fact that it still need so many
> metadata
> >> accesses to acquire topic. A bad implementation causing such many
> metadata
> >> accesses is a software problem. If we had chance to improve it, why
> >> not?
> > I don't think the implementation is bad, but rather the assumptions,
> > as I said earlier. The data:metadata ratio has completely changed
> > completely. hedwig/bk were designed with a data:metadata ratio of
> > something like 100000:1. What we're talking about now is more like 1:1
> > and therefore we need to be able to handle an order of magnitude more
> > of metadata than previously. Bringing down the number of writes by an
> > order of 2 or 3, while a nice optimisation, is just putting duct tape
> > on the problem.
>
>
> I love the duct tape analogy. The number of accesses to the metadata is
> still linear on the number of topics, and reducing the number of accesses
> per acquisition does not change the complexity. Perhaps distributing the
> accesses over time, optimized or not, might be a good way to proceed
> overall. Another approach would be to group topics into buckets and have
> hubs acquiring buckets of topics, all topics of a bucket writing to the
> same ledgers. There is an obvious drawback to this approach when reading
> messages for a given topic.
>
> >
> >>
> >>> The ledger can still be read many times, but you have removed the
> >> guarantee that what is read each time will be the same thing.
> >>
> >> How we guarantee a reader's behavior when a ledger is removed at the
> same
> >> time? We don't guarantee it right now, right? It is similar thing for a
> >> 'shrink' operation which remove part of entries, while 'delete'
> operation
> >> removes whole entries?
> >>
> >> And if I remembered correctly, readers only see the same thing when a
> >> ledger is closed. What I proposed doesn't volatile this contract.  If a
> >> ledger is closed (state is in CLOSED), an application can't re-open it.
> If
> >> a ledger isn't closed yet, an application can recover previous state and
> >> continue writing entries using this ledger. for applications, they could
> >> still use 'create-close-create' style to use ledgers, or evolve to new
> api
> >> for efficiency smoothly, w/o breaking any backward compatibility.
> > Ah, yes, I misread your proposal originally, I thought the reopen was
> > working with an already closed ledger.
> >
> > On a side note, the reason we have an initial write for fencing, is
> > that when the reading client(RC) fences, the servers in the ensemble
> > start returning an error to the writing client (WC). At the moment we
> > don't distinguish between a fencing error and a i/o error for
> > example. So WC will try to rebuild a the ensemble by replacing the
> > erroring servers. Before writing to the new ensemble, it has to update
> > the metadata, and at this point it will see that it has been
> > fenced. With a specific FENCED error, we could avoid this write. This
> > makes me uncomfortable though. What happens if the fenced server fails
> > between being fenced and WC trying to write? It will get a normal i/o
> > error. And will try to replace the server. Since the metadata has not
> > been changed, nothing will stop it, and it may be able to continue
> > writing. I think this is also the case for the session fencing solution.
> >
> > -Ivan
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Reply via email to