Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Sijie Guo Fri, 18 Jan 2013 02:35:54 -0800

Thanks Flavio. Please see comments in line.

On Fri, Jan 18, 2013 at 12:31 AM, Flavio Junqueira <[email protected]>wrote:

> Hi Sijie,
>
> See a couple of comments below:
>
> On Jan 17, 2013, at 7:58 AM, Sijie Guo <[email protected]> wrote:
>
> >> If this observation is correct, can we acquire those topics lazily
> > instead of eagerly?
> >
> > could you add entries to a ledger without opening ledger?
> >
>
> By acquiring topics lazily, I meant to say that we delay topic acquisition
> until there is an operation for such a topic. This is a pretty standard
> technique to avoid load spikes.
>
> I'm sorry if you know the concept already, but to make sure we understand
> each other, this is the technique what I'm referring to:
>
>         http://en.wikipedia.org/wiki/Lazy_loading
>
> Jiannan pointed out one issue with this approach, though. Acquiring topics
> lazily implies that some operation triggers the acquisition of the topic.
> Such a trigger could be say a publish operation. In the case of
> subscriptions, however, there is no operation that could trigger the
> acquisition.
>

You need to first checkout the operations in Hedwig before talking about
lazy loading. for Hedwig, most of the operations are publish and subscribe.
For publish request, it requires ledger to be created to add entries, so
you could not load persistence info (ledger metadata) lazily; for subscribe
request, it means some one needs to read entries to deliver, it requires
ledger to be opened to read entries, so you could still not load
persistence info (ledger metadata) lazily. so for persistence info, you
don't have any chance to do lazily loading. That is why I asked the
question.

The issue Jiannan created to work on this idea is actually what I suggested
him to do that. You already talked about lazy loading, I would clarify the
idea on lazily loading:

1) for publish request triggering topic acquire, we don't need to load the
subscriptions of this topic. it is lazy loading for subscription metadata.
2) for subscribe request triggering topic acquire, we could not load the
persistence info for a topic, but we could defer the ledger creation and
persistence info update until next publish request. this is lazy creating
new ledger.

these are only two points I could see based on current implementation, we
could do for lazily loading. If I missed something, you could point out.

>
>
> >> Another approach would be to group topics into buckets and have hubs
> > acquiring buckets of topics, all topics of a bucket writing to the same
> > ledgers. There is an obvious drawback to this approach when reading
> > messages for a given topic.
> >
> > for topic ownership, we could proceed it in groups as I proposed in
> > BOOKKEEPER-508.
> > for topic persistence info, we should not do that, otherwise it volatile
> > the contract that Hedwig made and you add complexity into Hedwig itself.
> > Grouping could be considered in application side, but if application
> could
> > not or is difficult to do that, we still have to face such many topics to
> > make Hedwig as a scalable platform.
>
> I didn't mean to re-propose groups, and I'm happy you did it already. My
> point was simply that the complexity is still linear, and I think that to
> have lower complexity we would need some technique like grouping.
>
> >
> > I don't like what you said 'the number of accesses to the metadata is
> still
> > linear on the number of topics'. Metadata accesses problem is an 'entity
> > locating' problem. Two categories of solutions are used to locate an
> > entity, one is mapping, the other one is computation.
> >
>
>
> We may not like it, but it is still linear. :-)
>
>
> > Mapping means you need to find a place to store this entity or store the
> > location of this entity. Like HDFS NameNode storing the locations of
> blocks
> > in Namenode. If an entity existed in this way, you had at least one
> access
> > to find it. no matter it was stored in ZooKeeper, in memory, embedded in
> > bookie servers or grouped in somewhere. You could not make any changes to
> > the 'linear' relationship between number of accesses and number of
> > entities. To make it more clearly, in HDFS, 'entity' means a data block,
> if
> > you had an entity location, read data is fast, but cross boundary of
> > blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity'
> means
> > a topic, if the topic is owned, read messages in this topic is fast. But
> in
> > either case, the number of accesses is still linear on the number of
> > entities.
>
>
> Agreed, you need one access per entity. The two techniques I know of and
> that are often used to deal efficiently with the large number of accesses
> to such entities are lazy loading, which spreads more the accesses over
> time and allows the system to make progress without waiting for all of the
> accesses to occur, and grouping, which makes the granularity of entities
> coarser as you say.
>

Agreed on lazy loading and grouping idea.

But I also clarified in somewhere previous emails. reducing the complexity
could be multiple dimensions. the proposal is one dimension, although not
so good, but a starting point. we had to improve it by iterations.

>
> >
> > Computation means I could compute the location of an entity based on its
> > name, without looking up somewhere. E.g, Consistent Hashing.
> >
> > Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> place
> > to record this ledger id. This is a 'Mapping'. So it is 2 levels mapping.
> > one is topic name to Hedwig's metadata, the other is Hedwig metadata to
> > Ledger's metadata. You could not get rid of 'linear' limitation w/o
> > breaking the mapping between Hedwig metadata and Ledger metadata.
>
> This is one kind of task that we have designed ZK for, storing metadata on
> behalf of the application, although you seem to have reached the limits of
> ZK ;-). I don't think it is bad to expect the application to store the
> sequence of ledgers it has written elsewhere. But again, this is one of the
> things that managed ledgers could do for you. It won't solve the metadata
> issue, but it does solve the abstraction issue.
>

> I don't think it is bad to expect the application to store the sequence
of ledgers it has written elsewhere.

It depends. some applications are happy to do that. but some applications
might be bad using such style. Clarify again, I don't say get rid of such
ledger id style. But I clarified, ledger id could be based on a ledger id
generation plus a ledger name solution. The proposal just provides a lower
level api for applications to optimize its performance.

secondly, I clarified before and again here, storing ledger id and rolling
a new ledger to write is not just introducing metadata accesses spike
during startup but also introducing latency spike in publish path.
applications could not get rid of this problem if they sticked on using
such style. the reason is same as above, 'could you add entries before
opening a ledger'.

one more thing you might ask 'why we need to change ledgers in the life of
topic'. I would clarify here before you asked, changing ledger feature is
added to let topic have chance to consume ('delete') its entries. you could
configure a higher number to defer the ledger changing, but it doesn't
change the fact that latency spike when ledger change happened.

what I tried to propose is tried to remove any potential metadata accesses
in critical path, like publish. The proposal is friendly for consuming
(deleting) entries w/o affecting publish latency in critical path.

> But again, this is one of the things that managed ledgers could do for
you.

managed ledgers bring too much concepts like cursor. the interface could
like managed ledgers, I agreed. If the interface is one concern for you, I
am feeling OK to make a change.

>
> >
> > What I proposed is still 'linear'. But it consolidates these two metadata
> > into one, and breaks the mapping between Hedwig metadata and Ledger
> > metadata. Although Hedwig still requires one lookup from topic name to
> > metadata, it gives the opportunity to get rid of this lookup in future
> > since we could use 'ledger name' to access its entries now.
> >
>
> Closing a ledger has one fundamental property, which is having consensus
> upon the content of a ledger. I was actually wondering how much this
> property matters to Hedwig, it certainly does for HDFS.
>

Closed is only matter for those application having multiple readers. They
need agreement on the entries set for a ledger. But so far, I don't see any
applications using multiple readers for BookKeeper. Hedwig doesn't while
Namenode also doesn't (If I understand correctly. please point out if I am
wrong). For most applications, the reader and writer are actually same one,
so why they need to close a ledger? why they can't just re-open and append
entries? it is a natural way to do that rather than "close one, create a
new one and recorde ledger ids to track them".

> 'it certainly does for HDFS'

I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode
use bookkeeper for fail over. So the entries doesn't be read until fail
over happened. Actually, it doesn't need to close the ledger as my
explanation as above.

>
>
> > An idea like 'placement group' in ceph could be deployed in BookKeeper in
> > future, which grouping ledgers to share same ensemble configuration. It
> > would reduce the number of mappings between ledger name to ledger
> metadata.
> > but it should be a different topic to discuss. All what I said here is
> just
> > to demonstrate the proposal I made here could make the improvement
> proceed
> > in right direction.
> >
> >
>
> That's a good idea, although I wonder how much each group would actually
> share. For example, if I delete one ledger in the group, would I delete the
> ledger only when all other ledgers in the group are also deleted?
>

I don't have a whole view about this idea. But this initial idea is:

for a 'placement group', there is only one ensemble. it is a kind of super
ensemble. there is no ensemble change in this 'placement group'. so:

1) how to face bookie failure w/o ensemble change. we already have write
quorum and ack quorum, it would help facing bookie failure. this is why I
called it a super ensemble, it means that the number of bookies might need
to be enough to tolerance enough ack quorum for some granularity
of availability.
2) bookie failure forever and re-replication. a placement group is also
friendly for re-replication. re-replication could only happen in the
bookies in this placement group, it is self-organized and make the
re-replication purely distributed.  Facing a bookie failure forever,
re-replication could pickup another node to replace it. The re-build of the
failed bookie, we just need to contact the other bookies in the placement
group to ask the entries they owned for this placement group. we don't need
to contact any metadata.
3) garbage collection. as there is no ensemble change, it means there is no
zombie entries. improved gc algorithm would work very well for it.
4) actually I am thinking we might not need to record ledger metadata any
more. we could use some kind of hashing or prefixing mechanism to identify
which placement group a ledger is in. Only one point need to consider is
how to tell a ledger is not existed or a ledger just have empty entries. if
it doesn't matter then it might work.

This is just some initial ideas. If you are interested on it, we could open
another thread for discussion.

>
> -Flavio
>
> >
> >
> > On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <[email protected]
> >wrote:
> >
> >> Thanks for the great discussion so far. I have a couple of comments
> below:
> >>
> >> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote:
> >>
> >>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>>>> Originally, it was meant to have a number of
> >>>>> long lived subscriptions, over which a lot of data travelled. Now the
> >>>>> load has flipped to a large number of short lived subscriptions, over
> >>>>> which relatively little data travels.
> >>>>
> >>>> The topic discussed here doesn't relate to hedwig subscriptions, it
> just
> >>>> about how hedwig use ledgers to store its messages.  Even there are no
> >>>> subscriptions, the problem is still there. The restart of a hub server
> >>>> carrying large number of topics would hit the metadata storage with
> many
> >>>> accesses. The hit is a hub server acquiring a topic, no matter the
> >>>> subscription is long lived or short lived. after topic is acquired,
> >>>> following accesses are in memory, which doesn't cause any
> >>>> performance issue.
> >>> I was using topics and subscriptions to mean the same thing here due
> >>> to the usecase we have in Yahoo where they're effectively the same
> >>> thing. But yes, I should have said topic. But my point still
> >>> stands. Hedwig was designed to deal with fewer topics, which had a lot
> >>> of data passing through them, rather than more topics, with very
> >>> little data passing though them. This is why zk was consider suffient
> >>> at that point, as tens of thousands of topics being recovered really
> >>> isn't an issue for zk. The point I was driving at is that, the usecase
> >>> has changed in a big way, so it may require a big change to handle it.
> >>>
> >>
> >> About the restart of a hub, if I understand the problem, many topics
> will
> >> be acquired concurrently, inducing the load spike you're mentioning. If
> >> this observation is correct, can we acquire those topics lazily instead
> of
> >> eagerly?
> >>
> >>>> But we should separate the capacity problem from the software
> problem. A
> >>>> high performance and scalable metadata storage would help for
> resolving
> >>>> capacity problem. but either implementing a new one or leveraging a
> high
> >>>> performance one doesn't change the fact that it still need so many
> >> metadata
> >>>> accesses to acquire topic. A bad implementation causing such many
> >> metadata
> >>>> accesses is a software problem. If we had chance to improve it, why
> >>>> not?
> >>> I don't think the implementation is bad, but rather the assumptions,
> >>> as I said earlier. The data:metadata ratio has completely changed
> >>> completely. hedwig/bk were designed with a data:metadata ratio of
> >>> something like 100000:1. What we're talking about now is more like 1:1
> >>> and therefore we need to be able to handle an order of magnitude more
> >>> of metadata than previously. Bringing down the number of writes by an
> >>> order of 2 or 3, while a nice optimisation, is just putting duct tape
> >>> on the problem.
> >>
> >>
> >> I love the duct tape analogy. The number of accesses to the metadata is
> >> still linear on the number of topics, and reducing the number of
> accesses
> >> per acquisition does not change the complexity. Perhaps distributing the
> >> accesses over time, optimized or not, might be a good way to proceed
> >> overall. Another approach would be to group topics into buckets and have
> >> hubs acquiring buckets of topics, all topics of a bucket writing to the
> >> same ledgers. There is an obvious drawback to this approach when reading
> >> messages for a given topic.
> >>
> >>>
> >>>>
> >>>>> The ledger can still be read many times, but you have removed the
> >>>> guarantee that what is read each time will be the same thing.
> >>>>
> >>>> How we guarantee a reader's behavior when a ledger is removed at the
> >> same
> >>>> time? We don't guarantee it right now, right? It is similar thing for
> a
> >>>> 'shrink' operation which remove part of entries, while 'delete'
> >> operation
> >>>> removes whole entries?
> >>>>
> >>>> And if I remembered correctly, readers only see the same thing when a
> >>>> ledger is closed. What I proposed doesn't volatile this contract.  If
> a
> >>>> ledger is closed (state is in CLOSED), an application can't re-open
> it.
> >> If
> >>>> a ledger isn't closed yet, an application can recover previous state
> and
> >>>> continue writing entries using this ledger. for applications, they
> could
> >>>> still use 'create-close-create' style to use ledgers, or evolve to new
> >> api
> >>>> for efficiency smoothly, w/o breaking any backward compatibility.
> >>> Ah, yes, I misread your proposal originally, I thought the reopen was
> >>> working with an already closed ledger.
> >>>
> >>> On a side note, the reason we have an initial write for fencing, is
> >>> that when the reading client(RC) fences, the servers in the ensemble
> >>> start returning an error to the writing client (WC). At the moment we
> >>> don't distinguish between a fencing error and a i/o error for
> >>> example. So WC will try to rebuild a the ensemble by replacing the
> >>> erroring servers. Before writing to the new ensemble, it has to update
> >>> the metadata, and at this point it will see that it has been
> >>> fenced. With a specific FENCED error, we could avoid this write. This
> >>> makes me uncomfortable though. What happens if the fenced server fails
> >>> between being fenced and WC trying to write? It will get a normal i/o
> >>> error. And will try to replace the server. Since the metadata has not
> >>> been changed, nothing will stop it, and it may be able to continue
> >>> writing. I think this is also the case for the session fencing
> solution.
> >>>
> >>> -Ivan
> >>
> >>
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Reply via email to