Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Sijie Guo Fri, 18 Jan 2013 20:23:21 -0800

check comments in line...


On Fri, Jan 18, 2013 at 6:08 AM, Flavio Junqueira <[email protected]>wrote:

> One problem for me with this thread is that there are parallel discussions
> it refers to and the information seems to be spread across different liras,
> other threads, and private communication. I personally find it hard to be
> constructive and converge like that, which I think it should be our goal.
> Can we find a better way of making progress, perhaps discussing issues
> separately and prioritizing them? One possibility is that we have open
> calls (e.g., skype, google hangout) to discuss concrete issues. We should
> of course post our findings to the list or wiki, whichever is more
> convenient.
>

I tried my best to not let this thread out of control. But I am not sure is
my gist so bad that you don't understand, or you don't take a look at it
and its references. I found that I just end up clarifying some points again
and again that it is already pointed in the document. I was tired of that
(had to do lots of context switches off working time). I don't think any
open calls would be efficient before we are on the same page. so I would
step back to discuss the ledger id and its working style in another thread
first.


>
> For the moment, I'll add some more comments trying to focus on
> clarifications.
>
>
> >> On Jan 17, 2013, at 7:58 AM, Sijie Guo <[email protected]> wrote:
> >>
> >>>> If this observation is correct, can we acquire those topics lazily
> >>> instead of eagerly?
> >>>
> >>> could you add entries to a ledger without opening ledger?
> >>>
> >>
> >> By acquiring topics lazily, I meant to say that we delay topic
> acquisition
> >> until there is an operation for such a topic. This is a pretty standard
> >> technique to avoid load spikes.
> >>
> >> I'm sorry if you know the concept already, but to make sure we
> understand
> >> each other, this is the technique what I'm referring to:
> >>
> >>        http://en.wikipedia.org/wiki/Lazy_loading
> >>
> >> Jiannan pointed out one issue with this approach, though. Acquiring
> topics
> >> lazily implies that some operation triggers the acquisition of the
> topic.
> >> Such a trigger could be say a publish operation. In the case of
> >> subscriptions, however, there is no operation that could trigger the
> >> acquisition.
> >>
> >
> > You need to first checkout the operations in Hedwig before talking about
> > lazy loading. for Hedwig, most of the operations are publish and
> subscribe.
> > For publish request, it requires ledger to be created to add entries, so
> > you could not load persistence info (ledger metadata) lazily; for
> subscribe
> > request, it means some one needs to read entries to deliver, it requires
> > ledger to be opened to read entries, so you could still not load
> > persistence info (ledger metadata) lazily. so for persistence info, you
> > don't have any chance to do lazily loading. That is why I asked the
> > question.
> >
> > The issue Jiannan created to work on this idea is actually what I
> suggested
> > him to do that.
>
> I don't know which issue you're referring to. Is this a jira issue?
>

You said "Jiannan pointed out one issue with this approach, though.", I
assumed you already new the issue. But seems that you didn't...

This issue is https://issues.apache.org/jira/browse/BOOKKEEPER-518. And
this issue is also a reference in my gist.


>
> > You already talked about lazy loading, I would clarify the
> > idea on lazily loading:
> >
> > 1) for publish request triggering topic acquire, we don't need to load
> the
> > subscriptions of this topic. it is lazy loading for subscription
> metadata.
> > 2) for subscribe request triggering topic acquire, we could not load the
> > persistence info for a topic, but we could defer the ledger creation and
> > persistence info update until next publish request. this is lazy creating
> > new ledger.
> >
> > these are only two points I could see based on current implementation, we
> > could do for lazily loading. If I missed something, you could point out.
> >
> >
>
> Thanks for the clarification. I don't have anything to add to the list.
>
> >
> >>
> >>
> >>>> Another approach would be to group topics into buckets and have hubs
> >>> acquiring buckets of topics, all topics of a bucket writing to the same
> >>> ledgers. There is an obvious drawback to this approach when reading
> >>> messages for a given topic.
> >>>
> >>> for topic ownership, we could proceed it in groups as I proposed in
> >>> BOOKKEEPER-508.
> >>> for topic persistence info, we should not do that, otherwise it
> volatile
> >>> the contract that Hedwig made and you add complexity into Hedwig
> itself.
> >>> Grouping could be considered in application side, but if application
> >> could
> >>> not or is difficult to do that, we still have to face such many topics
> to
> >>> make Hedwig as a scalable platform.
> >>
> >> I didn't mean to re-propose groups, and I'm happy you did it already. My
> >> point was simply that the complexity is still linear, and I think that
> to
> >> have lower complexity we would need some technique like grouping.
> >>
> >>>
> >>> I don't like what you said 'the number of accesses to the metadata is
> >> still
> >>> linear on the number of topics'. Metadata accesses problem is an
> 'entity
> >>> locating' problem. Two categories of solutions are used to locate an
> >>> entity, one is mapping, the other one is computation.
> >>>
> >>
> >>
> >> We may not like it, but it is still linear. :-)
> >>
> >>
> >>> Mapping means you need to find a place to store this entity or store
> the
> >>> location of this entity. Like HDFS NameNode storing the locations of
> >> blocks
> >>> in Namenode. If an entity existed in this way, you had at least one
> >> access
> >>> to find it. no matter it was stored in ZooKeeper, in memory, embedded
> in
> >>> bookie servers or grouped in somewhere. You could not make any changes
> to
> >>> the 'linear' relationship between number of accesses and number of
> >>> entities. To make it more clearly, in HDFS, 'entity' means a data
> block,
> >> if
> >>> you had an entity location, read data is fast, but cross boundary of
> >>> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity'
> >> means
> >>> a topic, if the topic is owned, read messages in this topic is fast.
> But
> >> in
> >>> either case, the number of accesses is still linear on the number of
> >>> entities.
> >>
> >>
> >> Agreed, you need one access per entity. The two techniques I know of and
> >> that are often used to deal efficiently with the large number of
> accesses
> >> to such entities are lazy loading, which spreads more the accesses over
> >> time and allows the system to make progress without waiting for all of
> the
> >> accesses to occur, and grouping, which makes the granularity of entities
> >> coarser as you say.
> >>
> >
> > Agreed on lazy loading and grouping idea.
> >
> > But I also clarified in somewhere previous emails. reducing the
> complexity
> > could be multiple dimensions. the proposal is one dimension, although not
> > so good, but a starting point. we had to improve it by iterations.
> >
> >
> >>
> >>>
> >>> Computation means I could compute the location of an entity based on
> its
> >>> name, without looking up somewhere. E.g, Consistent Hashing.
> >>>
> >>> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a
> >> place
> >>> to record this ledger id. This is a 'Mapping'. So it is 2 levels
> mapping.
> >>> one is topic name to Hedwig's metadata, the other is Hedwig metadata to
> >>> Ledger's metadata. You could not get rid of 'linear' limitation w/o
> >>> breaking the mapping between Hedwig metadata and Ledger metadata.
> >>
> >> This is one kind of task that we have designed ZK for, storing metadata
> on
> >> behalf of the application, although you seem to have reached the limits
> of
> >> ZK ;-). I don't think it is bad to expect the application to store the
> >> sequence of ledgers it has written elsewhere. But again, this is one of
> the
> >> things that managed ledgers could do for you. It won't solve the
> metadata
> >> issue, but it does solve the abstraction issue.
> >>
> >
> >> I don't think it is bad to expect the application to store the sequence
> > of ledgers it has written elsewhere.
> >
> > It depends. some applications are happy to do that. but some applications
> > might be bad using such style. Clarify again, I don't say get rid of such
> > ledger id style. But I clarified, ledger id could be based on a ledger id
> > generation plus a ledger name solution. The proposal just provides a
> lower
> > level api for applications to optimize its performance.
> >
> > secondly, I clarified before and again here, storing ledger id and
> rolling
> > a new ledger to write is not just introducing metadata accesses spike
> > during startup but also introducing latency spike in publish path.
> > applications could not get rid of this problem if they sticked on using
> > such style. the reason is same as above, 'could you add entries before
> > opening a ledger'.
>
> Rolling a ledger doesn't have be in the critical path of publish request.
>  When we roll a ledger, is it synchronous with respect to the publish
> request that triggered it?
>
> >
> > one more thing you might ask 'why we need to change ledgers in the life
> of
> > topic'. I would clarify here before you asked, changing ledger feature is
> > added to let topic have chance to consume ('delete') its entries. you
> could
> > configure a higher number to defer the ledger changing, but it doesn't
> > change the fact that latency spike when ledger change happened.
>
> Don't you think that we can get rid of this latency spike by rolling
> ledgers asynchronously? I'm not too familiar with that part of the code, so
> you possibly have a better feeling if it can be done.
>
> >
> > what I tried to propose is tried to remove any potential metadata
> accesses
> > in critical path, like publish. The proposal is friendly for consuming
> > (deleting) entries w/o affecting publish latency in critical path.
>
> Overall trying to remove or reduce access to metadata from the critical
> path sounds good.
>
> >
> >> But again, this is one of the things that managed ledgers could do for
> > you.
> >
> > managed ledgers bring too much concepts like cursor. the interface could
> > like managed ledgers, I agreed. If the interface is one concern for you,
> I
> > am feeling OK to make a change.
>
> I'm not suggesting we change the interface to look like managed ledgers,
> I'm saying that this functionality can be built on top, while keeping the
> abstraction BookKeeper exposes simple.
>
> >
> >>>
> >>> What I proposed is still 'linear'. But it consolidates these two
> metadata
> >>> into one, and breaks the mapping between Hedwig metadata and Ledger
> >>> metadata. Although Hedwig still requires one lookup from topic name to
> >>> metadata, it gives the opportunity to get rid of this lookup in future
> >>> since we could use 'ledger name' to access its entries now.
> >>>
> >>
> >> Closing a ledger has one fundamental property, which is having consensus
> >> upon the content of a ledger. I was actually wondering how much this
> >> property matters to Hedwig, it certainly does for HDFS.
> >>
> >
> > Closed is only matter for those application having multiple readers. They
> > need agreement on the entries set for a ledger. But so far, I don't see
> any
> > applications using multiple readers for BookKeeper. Hedwig doesn't while
> > Namenode also doesn't (If I understand correctly. please point out if I
> am
> > wrong). For most applications, the reader and writer are actually same
> one,
> > so why they need to close a ledger? why they can't just re-open and
> append
> > entries? it is a natural way to do that rather than "close one, create a
> > new one and recorde ledger ids to track them".
> >
> >> 'it certainly does for HDFS'
> >
> > I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode
> > use bookkeeper for fail over. So the entries doesn't be read until fail
> > over happened. Actually, it doesn't need to close the ledger as my
> > explanation as above.
> >
>
> HDFS as far as I can tell allow multiple standbys. I believe the
> documentation says that typically there is one standby, but it doesn't
> limit to one and the configuration allows for more. For Hedwig, it matters
> if you have different hubs reading the same ledger over time. Say that:
>
> - hub 1 creates a ledger for a topic
> - hub 2 becomes the owner of the topic, reads the (open) ledger, delivers
> to some subscriber, and crashes
> - hub 3 becomes the new owner of the topic, reads the (open) ledger, and
> also delivers to a different subscriber.
>
> If we don't have agreement, hubs 2 and 3 might disagree on the messages
> they see. I suppose a similar argument holds for Cloud Messaging at Yahoo!
> and Hubspot, but I don't know enough about Hubspot to say for sure.
>
> About re-opening being more natural, systems like ZooKeeper and HDFS have
> used this pattern of rolling logs; even personal computers use this
> pattern. At this point, it might be a matter of preference, I'm not sure,
> but I really don't feel it is more natural to re-open.
>
> >
> >>
> >>
> >>> An idea like 'placement group' in ceph could be deployed in BookKeeper
> in
> >>> future, which grouping ledgers to share same ensemble configuration. It
> >>> would reduce the number of mappings between ledger name to ledger
> >> metadata.
> >>> but it should be a different topic to discuss. All what I said here is
> >> just
> >>> to demonstrate the proposal I made here could make the improvement
> >> proceed
> >>> in right direction.
> >>>
> >>>
> >>
> >> That's a good idea, although I wonder how much each group would actually
> >> share. For example, if I delete one ledger in the group, would I delete
> the
> >> ledger only when all other ledgers in the group are also deleted?
> >>
> >
> > I don't have a whole view about this idea. But this initial idea is:
> >
> > for a 'placement group', there is only one ensemble. it is a kind of
> super
> > ensemble. there is no ensemble change in this 'placement group'. so:
> >
> > 1) how to face bookie failure w/o ensemble change. we already have write
> > quorum and ack quorum, it would help facing bookie failure. this is why I
> > called it a super ensemble, it means that the number of bookies might
> need
> > to be enough to tolerance enough ack quorum for some granularity
> > of availability.
> > 2) bookie failure forever and re-replication. a placement group is also
> > friendly for re-replication. re-replication could only happen in the
> > bookies in this placement group, it is self-organized and make the
> > re-replication purely distributed.  Facing a bookie failure forever,
> > re-replication could pickup another node to replace it. The re-build of
> the
> > failed bookie, we just need to contact the other bookies in the placement
> > group to ask the entries they owned for this placement group. we don't
> need
> > to contact any metadata.
> > 3) garbage collection. as there is no ensemble change, it means there is
> no
> > zombie entries. improved gc algorithm would work very well for it.
> > 4) actually I am thinking we might not need to record ledger metadata any
> > more. we could use some kind of hashing or prefixing mechanism to
> identify
> > which placement group a ledger is in. Only one point need to consider is
> > how to tell a ledger is not existed or a ledger just have empty entries.
> if
> > it doesn't matter then it might work.
> >
> > This is just some initial ideas. If you are interested on it, we could
> open
> > another thread for discussion.
> >
>
> It would be good to discuss this, but we definitely need to separate and
> prioritize the discussions, otherwise it becomes unmanageable.
>
> -Flavio
>
> >>>
> >>>
> >>> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira <
> [email protected]
> >>> wrote:
> >>>
> >>>> Thanks for the great discussion so far. I have a couple of comments
> >> below:
> >>>>
> >>>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote:
> >>>>
> >>>>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote:
> >>>>>>> Originally, it was meant to have a number of
> >>>>>>> long lived subscriptions, over which a lot of data travelled. Now
> the
> >>>>>>> load has flipped to a large number of short lived subscriptions,
> over
> >>>>>>> which relatively little data travels.
> >>>>>>
> >>>>>> The topic discussed here doesn't relate to hedwig subscriptions, it
> >> just
> >>>>>> about how hedwig use ledgers to store its messages.  Even there are
> no
> >>>>>> subscriptions, the problem is still there. The restart of a hub
> server
> >>>>>> carrying large number of topics would hit the metadata storage with
> >> many
> >>>>>> accesses. The hit is a hub server acquiring a topic, no matter the
> >>>>>> subscription is long lived or short lived. after topic is acquired,
> >>>>>> following accesses are in memory, which doesn't cause any
> >>>>>> performance issue.
> >>>>> I was using topics and subscriptions to mean the same thing here due
> >>>>> to the usecase we have in Yahoo where they're effectively the same
> >>>>> thing. But yes, I should have said topic. But my point still
> >>>>> stands. Hedwig was designed to deal with fewer topics, which had a
> lot
> >>>>> of data passing through them, rather than more topics, with very
> >>>>> little data passing though them. This is why zk was consider suffient
> >>>>> at that point, as tens of thousands of topics being recovered really
> >>>>> isn't an issue for zk. The point I was driving at is that, the
> usecase
> >>>>> has changed in a big way, so it may require a big change to handle
> it.
> >>>>>
> >>>>
> >>>> About the restart of a hub, if I understand the problem, many topics
> >> will
> >>>> be acquired concurrently, inducing the load spike you're mentioning.
> If
> >>>> this observation is correct, can we acquire those topics lazily
> instead
> >> of
> >>>> eagerly?
> >>>>
> >>>>>> But we should separate the capacity problem from the software
> >> problem. A
> >>>>>> high performance and scalable metadata storage would help for
> >> resolving
> >>>>>> capacity problem. but either implementing a new one or leveraging a
> >> high
> >>>>>> performance one doesn't change the fact that it still need so many
> >>>> metadata
> >>>>>> accesses to acquire topic. A bad implementation causing such many
> >>>> metadata
> >>>>>> accesses is a software problem. If we had chance to improve it, why
> >>>>>> not?
> >>>>> I don't think the implementation is bad, but rather the assumptions,
> >>>>> as I said earlier. The data:metadata ratio has completely changed
> >>>>> completely. hedwig/bk were designed with a data:metadata ratio of
> >>>>> something like 100000:1. What we're talking about now is more like
> 1:1
> >>>>> and therefore we need to be able to handle an order of magnitude more
> >>>>> of metadata than previously. Bringing down the number of writes by an
> >>>>> order of 2 or 3, while a nice optimisation, is just putting duct tape
> >>>>> on the problem.
> >>>>
> >>>>
> >>>> I love the duct tape analogy. The number of accesses to the metadata
> is
> >>>> still linear on the number of topics, and reducing the number of
> >> accesses
> >>>> per acquisition does not change the complexity. Perhaps distributing
> the
> >>>> accesses over time, optimized or not, might be a good way to proceed
> >>>> overall. Another approach would be to group topics into buckets and
> have
> >>>> hubs acquiring buckets of topics, all topics of a bucket writing to
> the
> >>>> same ledgers. There is an obvious drawback to this approach when
> reading
> >>>> messages for a given topic.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> The ledger can still be read many times, but you have removed the
> >>>>>> guarantee that what is read each time will be the same thing.
> >>>>>>
> >>>>>> How we guarantee a reader's behavior when a ledger is removed at the
> >>>> same
> >>>>>> time? We don't guarantee it right now, right? It is similar thing
> for
> >> a
> >>>>>> 'shrink' operation which remove part of entries, while 'delete'
> >>>> operation
> >>>>>> removes whole entries?
> >>>>>>
> >>>>>> And if I remembered correctly, readers only see the same thing when
> a
> >>>>>> ledger is closed. What I proposed doesn't volatile this contract.
>  If
> >> a
> >>>>>> ledger is closed (state is in CLOSED), an application can't re-open
> >> it.
> >>>> If
> >>>>>> a ledger isn't closed yet, an application can recover previous state
> >> and
> >>>>>> continue writing entries using this ledger. for applications, they
> >> could
> >>>>>> still use 'create-close-create' style to use ledgers, or evolve to
> new
> >>>> api
> >>>>>> for efficiency smoothly, w/o breaking any backward compatibility.
> >>>>> Ah, yes, I misread your proposal originally, I thought the reopen was
> >>>>> working with an already closed ledger.
> >>>>>
> >>>>> On a side note, the reason we have an initial write for fencing, is
> >>>>> that when the reading client(RC) fences, the servers in the ensemble
> >>>>> start returning an error to the writing client (WC). At the moment we
> >>>>> don't distinguish between a fencing error and a i/o error for
> >>>>> example. So WC will try to rebuild a the ensemble by replacing the
> >>>>> erroring servers. Before writing to the new ensemble, it has to
> update
> >>>>> the metadata, and at this point it will see that it has been
> >>>>> fenced. With a specific FENCED error, we could avoid this write. This
> >>>>> makes me uncomfortable though. What happens if the fenced server
> fails
> >>>>> between being fenced and WC trying to write? It will get a normal i/o
> >>>>> error. And will try to replace the server. Since the metadata has not
> >>>>> been changed, nothing will stop it, and it may be able to continue
> >>>>> writing. I think this is also the case for the session fencing
> >> solution.
> >>>>>
> >>>>> -Ivan
> >>>>
> >>>>
> >>
> >>
>
>

Re: [Proposal] Support "re-open, append" semantic in BookKeeper.

Reply via email to