Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Sijie Guo Wed, 16 Jan 2013 23:23:29 -0800

re-clarify the points in the proposal to make it clear.

1) change ledger id to ledger name.


giving a name is a more natural way for human beings. id generation just
eases the work naming a ledger. I don't see any optimization or efficiency
that we gained in BookKeeper is based on a 'long' ledger id so far.

2) session fencing rather than fencing.

same working mechanism as fencing. same correctness guarantee.

3) remove entries range rather than delete a ledger.

same working mechanism as improved gc proposed in BOOKKEEPER-464.


On Wed, Jan 16, 2013 at 10:58 PM, Sijie Guo <[email protected]> wrote:

> > If this observation is correct, can we acquire those topics lazily
> instead of eagerly?
>
> could you add entries to a ledger without opening ledger?
>
> > Another approach would be to group topics into buckets and have hubs
> acquiring buckets of topics, all topics of a bucket writing to the same
> ledgers. There is an obvious drawback to this approach when reading
> messages for a given topic.
>
> for topic ownership, we could proceed it in groups as I proposed in
> BOOKKEEPER-508.
> for topic persistence info, we should not do that, otherwise it volatile
> the contract that Hedwig made and you add complexity into Hedwig itself.
> Grouping could be considered in application side, but if application could
> not or is difficult to do that, we still have to face such many topics to
> make Hedwig as a scalable platform.
>
> I don't like what you said 'the number of accesses to the metadata is
> still linear on the number of topics'. Metadata accesses problem is an
> 'entity locating' problem. Two categories of solutions are used to locate
> an entity, one is mapping, the other one is computation.
>
> Mapping means you need to find a place to store this entity or store the
> location of this entity. Like HDFS NameNode storing the locations of blocks
> in Namenode. If an entity existed in this way, you had at least one access
> to find it. no matter it was stored in ZooKeeper, in memory, embedded in
> bookie servers or grouped in somewhere. You could not make any changes to
> the 'linear' relationship between number of accesses and number of
> entities. To make it more clearly, in HDFS, 'entity' means a data block, if
> you had an entity location, read data is fast, but cross boundary of
> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means
> a topic, if the topic is owned, read messages in this topic is fast. But in
> either case, the number of accesses is still linear on the number of
> entities.
>
> Computation means I could compute the location of an entity based on its
> name, without looking up somewhere. E.g, Consistent Hashing.
>
> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> place to record this ledger id. This is a 'Mapping'. So it is 2 levels
> mapping. one is topic name to Hedwig's metadata, the other is Hedwig
> metadata to Ledger's metadata. You could not get rid of 'linear' limitation
> w/o breaking the mapping between Hedwig metadata and Ledger metadata.
>
> What I proposed is still 'linear'. But it consolidates these two metadata
> into one, and breaks the mapping between Hedwig metadata and Ledger
> metadata. Although Hedwig still requires one lookup from topic name to
> metadata, it gives the opportunity to get rid of this lookup in future
> since we could use 'ledger name' to access its entries now.
>
> An idea like 'placement group' in ceph could be deployed in BookKeeper in
> future, which grouping ledgers to share same ensemble configuration. It
> would reduce the number of mappings between ledger name to ledger metadata.
> but it should be a different topic to discuss. All what I said here is just
> to demonstrate the proposal I made here could make the improvement proceed
> in right direction.
>
>
>
>
> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira 
> <[email protected]>wrote:
>
>> Thanks for the great discussion so far. I have a couple of comments below:
>>
>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote:
>>
>> > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
>> >>> Originally, it was meant to have a number of
>> >>> long lived subscriptions, over which a lot of data travelled. Now the
>> >>> load has flipped to a large number of short lived subscriptions, over
>> >>> which relatively little data travels.
>> >>
>> >> The topic discussed here doesn't relate to hedwig subscriptions, it
>> just
>> >> about how hedwig use ledgers to store its messages.  Even there are no
>> >> subscriptions, the problem is still there. The restart of a hub server
>> >> carrying large number of topics would hit the metadata storage with
>> many
>> >> accesses. The hit is a hub server acquiring a topic, no matter the
>> >> subscription is long lived or short lived. after topic is acquired,
>> >> following accesses are in memory, which doesn't cause any
>> >> performance issue.
>> > I was using topics and subscriptions to mean the same thing here due
>> > to the usecase we have in Yahoo where they're effectively the same
>> > thing. But yes, I should have said topic. But my point still
>> > stands. Hedwig was designed to deal with fewer topics, which had a lot
>> > of data passing through them, rather than more topics, with very
>> > little data passing though them. This is why zk was consider suffient
>> > at that point, as tens of thousands of topics being recovered really
>> > isn't an issue for zk. The point I was driving at is that, the usecase
>> > has changed in a big way, so it may require a big change to handle it.
>> >
>>
>> About the restart of a hub, if I understand the problem, many topics will
>> be acquired concurrently, inducing the load spike you're mentioning. If
>> this observation is correct, can we acquire those topics lazily instead of
>> eagerly?
>>
>> >> But we should separate the capacity problem from the software problem.
>> A
>> >> high performance and scalable metadata storage would help for resolving
>> >> capacity problem. but either implementing a new one or leveraging a
>> high
>> >> performance one doesn't change the fact that it still need so many
>> metadata
>> >> accesses to acquire topic. A bad implementation causing such many
>> metadata
>> >> accesses is a software problem. If we had chance to improve it, why
>> >> not?
>> > I don't think the implementation is bad, but rather the assumptions,
>> > as I said earlier. The data:metadata ratio has completely changed
>> > completely. hedwig/bk were designed with a data:metadata ratio of
>> > something like 100000:1. What we're talking about now is more like 1:1
>> > and therefore we need to be able to handle an order of magnitude more
>> > of metadata than previously. Bringing down the number of writes by an
>> > order of 2 or 3, while a nice optimisation, is just putting duct tape
>> > on the problem.
>>
>>
>> I love the duct tape analogy. The number of accesses to the metadata is
>> still linear on the number of topics, and reducing the number of accesses
>> per acquisition does not change the complexity. Perhaps distributing the
>> accesses over time, optimized or not, might be a good way to proceed
>> overall. Another approach would be to group topics into buckets and have
>> hubs acquiring buckets of topics, all topics of a bucket writing to the
>> same ledgers. There is an obvious drawback to this approach when reading
>> messages for a given topic.
>>
>> >
>> >>
>> >>> The ledger can still be read many times, but you have removed the
>> >> guarantee that what is read each time will be the same thing.
>> >>
>> >> How we guarantee a reader's behavior when a ledger is removed at the
>> same
>> >> time? We don't guarantee it right now, right? It is similar thing for a
>> >> 'shrink' operation which remove part of entries, while 'delete'
>> operation
>> >> removes whole entries?
>> >>
>> >> And if I remembered correctly, readers only see the same thing when a
>> >> ledger is closed. What I proposed doesn't volatile this contract.  If a
>> >> ledger is closed (state is in CLOSED), an application can't re-open
>> it. If
>> >> a ledger isn't closed yet, an application can recover previous state
>> and
>> >> continue writing entries using this ledger. for applications, they
>> could
>> >> still use 'create-close-create' style to use ledgers, or evolve to new
>> api
>> >> for efficiency smoothly, w/o breaking any backward compatibility.
>> > Ah, yes, I misread your proposal originally, I thought the reopen was
>> > working with an already closed ledger.
>> >
>> > On a side note, the reason we have an initial write for fencing, is
>> > that when the reading client(RC) fences, the servers in the ensemble
>> > start returning an error to the writing client (WC). At the moment we
>> > don't distinguish between a fencing error and a i/o error for
>> > example. So WC will try to rebuild a the ensemble by replacing the
>> > erroring servers. Before writing to the new ensemble, it has to update
>> > the metadata, and at this point it will see that it has been
>> > fenced. With a specific FENCED error, we could avoid this write. This
>> > makes me uncomfortable though. What happens if the fenced server fails
>> > between being fenced and WC trying to write? It will get a normal i/o
>> > error. And will try to replace the server. Since the metadata has not
>> > been changed, nothing will stop it, and it may be able to continue
>> > writing. I think this is also the case for the session fencing solution.
>> >
>> > -Ivan
>>
>>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Reply via email to