check comments in line...
On Fri, Jan 18, 2013 at 6:08 AM, Flavio Junqueira <[email protected]>wrote: > One problem for me with this thread is that there are parallel discussions > it refers to and the information seems to be spread across different liras, > other threads, and private communication. I personally find it hard to be > constructive and converge like that, which I think it should be our goal. > Can we find a better way of making progress, perhaps discussing issues > separately and prioritizing them? One possibility is that we have open > calls (e.g., skype, google hangout) to discuss concrete issues. We should > of course post our findings to the list or wiki, whichever is more > convenient. > I tried my best to not let this thread out of control. But I am not sure is my gist so bad that you don't understand, or you don't take a look at it and its references. I found that I just end up clarifying some points again and again that it is already pointed in the document. I was tired of that (had to do lots of context switches off working time). I don't think any open calls would be efficient before we are on the same page. so I would step back to discuss the ledger id and its working style in another thread first. > > For the moment, I'll add some more comments trying to focus on > clarifications. > > > >> On Jan 17, 2013, at 7:58 AM, Sijie Guo <[email protected]> wrote: > >> > >>>> If this observation is correct, can we acquire those topics lazily > >>> instead of eagerly? > >>> > >>> could you add entries to a ledger without opening ledger? > >>> > >> > >> By acquiring topics lazily, I meant to say that we delay topic > acquisition > >> until there is an operation for such a topic. This is a pretty standard > >> technique to avoid load spikes. > >> > >> I'm sorry if you know the concept already, but to make sure we > understand > >> each other, this is the technique what I'm referring to: > >> > >> http://en.wikipedia.org/wiki/Lazy_loading > >> > >> Jiannan pointed out one issue with this approach, though. Acquiring > topics > >> lazily implies that some operation triggers the acquisition of the > topic. > >> Such a trigger could be say a publish operation. In the case of > >> subscriptions, however, there is no operation that could trigger the > >> acquisition. > >> > > > > You need to first checkout the operations in Hedwig before talking about > > lazy loading. for Hedwig, most of the operations are publish and > subscribe. > > For publish request, it requires ledger to be created to add entries, so > > you could not load persistence info (ledger metadata) lazily; for > subscribe > > request, it means some one needs to read entries to deliver, it requires > > ledger to be opened to read entries, so you could still not load > > persistence info (ledger metadata) lazily. so for persistence info, you > > don't have any chance to do lazily loading. That is why I asked the > > question. > > > > The issue Jiannan created to work on this idea is actually what I > suggested > > him to do that. > > I don't know which issue you're referring to. Is this a jira issue? > You said "Jiannan pointed out one issue with this approach, though.", I assumed you already new the issue. But seems that you didn't... This issue is https://issues.apache.org/jira/browse/BOOKKEEPER-518. And this issue is also a reference in my gist. > > > You already talked about lazy loading, I would clarify the > > idea on lazily loading: > > > > 1) for publish request triggering topic acquire, we don't need to load > the > > subscriptions of this topic. it is lazy loading for subscription > metadata. > > 2) for subscribe request triggering topic acquire, we could not load the > > persistence info for a topic, but we could defer the ledger creation and > > persistence info update until next publish request. this is lazy creating > > new ledger. > > > > these are only two points I could see based on current implementation, we > > could do for lazily loading. If I missed something, you could point out. > > > > > > Thanks for the clarification. I don't have anything to add to the list. > > > > >> > >> > >>>> Another approach would be to group topics into buckets and have hubs > >>> acquiring buckets of topics, all topics of a bucket writing to the same > >>> ledgers. There is an obvious drawback to this approach when reading > >>> messages for a given topic. > >>> > >>> for topic ownership, we could proceed it in groups as I proposed in > >>> BOOKKEEPER-508. > >>> for topic persistence info, we should not do that, otherwise it > volatile > >>> the contract that Hedwig made and you add complexity into Hedwig > itself. > >>> Grouping could be considered in application side, but if application > >> could > >>> not or is difficult to do that, we still have to face such many topics > to > >>> make Hedwig as a scalable platform. > >> > >> I didn't mean to re-propose groups, and I'm happy you did it already. My > >> point was simply that the complexity is still linear, and I think that > to > >> have lower complexity we would need some technique like grouping. > >> > >>> > >>> I don't like what you said 'the number of accesses to the metadata is > >> still > >>> linear on the number of topics'. Metadata accesses problem is an > 'entity > >>> locating' problem. Two categories of solutions are used to locate an > >>> entity, one is mapping, the other one is computation. > >>> > >> > >> > >> We may not like it, but it is still linear. :-) > >> > >> > >>> Mapping means you need to find a place to store this entity or store > the > >>> location of this entity. Like HDFS NameNode storing the locations of > >> blocks > >>> in Namenode. If an entity existed in this way, you had at least one > >> access > >>> to find it. no matter it was stored in ZooKeeper, in memory, embedded > in > >>> bookie servers or grouped in somewhere. You could not make any changes > to > >>> the 'linear' relationship between number of accesses and number of > >>> entities. To make it more clearly, in HDFS, 'entity' means a data > block, > >> if > >>> you had an entity location, read data is fast, but cross boundary of > >>> blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' > >> means > >>> a topic, if the topic is owned, read messages in this topic is fast. > But > >> in > >>> either case, the number of accesses is still linear on the number of > >>> entities. > >> > >> > >> Agreed, you need one access per entity. The two techniques I know of and > >> that are often used to deal efficiently with the large number of > accesses > >> to such entities are lazy loading, which spreads more the accesses over > >> time and allows the system to make progress without waiting for all of > the > >> accesses to occur, and grouping, which makes the granularity of entities > >> coarser as you say. > >> > > > > Agreed on lazy loading and grouping idea. > > > > But I also clarified in somewhere previous emails. reducing the > complexity > > could be multiple dimensions. the proposal is one dimension, although not > > so good, but a starting point. we had to improve it by iterations. > > > > > >> > >>> > >>> Computation means I could compute the location of an entity based on > its > >>> name, without looking up somewhere. E.g, Consistent Hashing. > >>> > >>> Currently, BookKeeper only gives a ledger id, so Hedwig has to find a > >> place > >>> to record this ledger id. This is a 'Mapping'. So it is 2 levels > mapping. > >>> one is topic name to Hedwig's metadata, the other is Hedwig metadata to > >>> Ledger's metadata. You could not get rid of 'linear' limitation w/o > >>> breaking the mapping between Hedwig metadata and Ledger metadata. > >> > >> This is one kind of task that we have designed ZK for, storing metadata > on > >> behalf of the application, although you seem to have reached the limits > of > >> ZK ;-). I don't think it is bad to expect the application to store the > >> sequence of ledgers it has written elsewhere. But again, this is one of > the > >> things that managed ledgers could do for you. It won't solve the > metadata > >> issue, but it does solve the abstraction issue. > >> > > > >> I don't think it is bad to expect the application to store the sequence > > of ledgers it has written elsewhere. > > > > It depends. some applications are happy to do that. but some applications > > might be bad using such style. Clarify again, I don't say get rid of such > > ledger id style. But I clarified, ledger id could be based on a ledger id > > generation plus a ledger name solution. The proposal just provides a > lower > > level api for applications to optimize its performance. > > > > secondly, I clarified before and again here, storing ledger id and > rolling > > a new ledger to write is not just introducing metadata accesses spike > > during startup but also introducing latency spike in publish path. > > applications could not get rid of this problem if they sticked on using > > such style. the reason is same as above, 'could you add entries before > > opening a ledger'. > > Rolling a ledger doesn't have be in the critical path of publish request. > When we roll a ledger, is it synchronous with respect to the publish > request that triggered it? > > > > > one more thing you might ask 'why we need to change ledgers in the life > of > > topic'. I would clarify here before you asked, changing ledger feature is > > added to let topic have chance to consume ('delete') its entries. you > could > > configure a higher number to defer the ledger changing, but it doesn't > > change the fact that latency spike when ledger change happened. > > Don't you think that we can get rid of this latency spike by rolling > ledgers asynchronously? I'm not too familiar with that part of the code, so > you possibly have a better feeling if it can be done. > > > > > what I tried to propose is tried to remove any potential metadata > accesses > > in critical path, like publish. The proposal is friendly for consuming > > (deleting) entries w/o affecting publish latency in critical path. > > Overall trying to remove or reduce access to metadata from the critical > path sounds good. > > > > >> But again, this is one of the things that managed ledgers could do for > > you. > > > > managed ledgers bring too much concepts like cursor. the interface could > > like managed ledgers, I agreed. If the interface is one concern for you, > I > > am feeling OK to make a change. > > I'm not suggesting we change the interface to look like managed ledgers, > I'm saying that this functionality can be built on top, while keeping the > abstraction BookKeeper exposes simple. > > > > >>> > >>> What I proposed is still 'linear'. But it consolidates these two > metadata > >>> into one, and breaks the mapping between Hedwig metadata and Ledger > >>> metadata. Although Hedwig still requires one lookup from topic name to > >>> metadata, it gives the opportunity to get rid of this lookup in future > >>> since we could use 'ledger name' to access its entries now. > >>> > >> > >> Closing a ledger has one fundamental property, which is having consensus > >> upon the content of a ledger. I was actually wondering how much this > >> property matters to Hedwig, it certainly does for HDFS. > >> > > > > Closed is only matter for those application having multiple readers. They > > need agreement on the entries set for a ledger. But so far, I don't see > any > > applications using multiple readers for BookKeeper. Hedwig doesn't while > > Namenode also doesn't (If I understand correctly. please point out if I > am > > wrong). For most applications, the reader and writer are actually same > one, > > so why they need to close a ledger? why they can't just re-open and > append > > entries? it is a natural way to do that rather than "close one, create a > > new one and recorde ledger ids to track them". > > > >> 'it certainly does for HDFS' > > > > I am doubting your certain on 'close' for HDFS. As my knowledge, Namenode > > use bookkeeper for fail over. So the entries doesn't be read until fail > > over happened. Actually, it doesn't need to close the ledger as my > > explanation as above. > > > > HDFS as far as I can tell allow multiple standbys. I believe the > documentation says that typically there is one standby, but it doesn't > limit to one and the configuration allows for more. For Hedwig, it matters > if you have different hubs reading the same ledger over time. Say that: > > - hub 1 creates a ledger for a topic > - hub 2 becomes the owner of the topic, reads the (open) ledger, delivers > to some subscriber, and crashes > - hub 3 becomes the new owner of the topic, reads the (open) ledger, and > also delivers to a different subscriber. > > If we don't have agreement, hubs 2 and 3 might disagree on the messages > they see. I suppose a similar argument holds for Cloud Messaging at Yahoo! > and Hubspot, but I don't know enough about Hubspot to say for sure. > > About re-opening being more natural, systems like ZooKeeper and HDFS have > used this pattern of rolling logs; even personal computers use this > pattern. At this point, it might be a matter of preference, I'm not sure, > but I really don't feel it is more natural to re-open. > > > > >> > >> > >>> An idea like 'placement group' in ceph could be deployed in BookKeeper > in > >>> future, which grouping ledgers to share same ensemble configuration. It > >>> would reduce the number of mappings between ledger name to ledger > >> metadata. > >>> but it should be a different topic to discuss. All what I said here is > >> just > >>> to demonstrate the proposal I made here could make the improvement > >> proceed > >>> in right direction. > >>> > >>> > >> > >> That's a good idea, although I wonder how much each group would actually > >> share. For example, if I delete one ledger in the group, would I delete > the > >> ledger only when all other ledgers in the group are also deleted? > >> > > > > I don't have a whole view about this idea. But this initial idea is: > > > > for a 'placement group', there is only one ensemble. it is a kind of > super > > ensemble. there is no ensemble change in this 'placement group'. so: > > > > 1) how to face bookie failure w/o ensemble change. we already have write > > quorum and ack quorum, it would help facing bookie failure. this is why I > > called it a super ensemble, it means that the number of bookies might > need > > to be enough to tolerance enough ack quorum for some granularity > > of availability. > > 2) bookie failure forever and re-replication. a placement group is also > > friendly for re-replication. re-replication could only happen in the > > bookies in this placement group, it is self-organized and make the > > re-replication purely distributed. Facing a bookie failure forever, > > re-replication could pickup another node to replace it. The re-build of > the > > failed bookie, we just need to contact the other bookies in the placement > > group to ask the entries they owned for this placement group. we don't > need > > to contact any metadata. > > 3) garbage collection. as there is no ensemble change, it means there is > no > > zombie entries. improved gc algorithm would work very well for it. > > 4) actually I am thinking we might not need to record ledger metadata any > > more. we could use some kind of hashing or prefixing mechanism to > identify > > which placement group a ledger is in. Only one point need to consider is > > how to tell a ledger is not existed or a ledger just have empty entries. > if > > it doesn't matter then it might work. > > > > This is just some initial ideas. If you are interested on it, we could > open > > another thread for discussion. > > > > It would be good to discuss this, but we definitely need to separate and > prioritize the discussions, otherwise it becomes unmanageable. > > -Flavio > > >>> > >>> > >>> On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira < > [email protected] > >>> wrote: > >>> > >>>> Thanks for the great discussion so far. I have a couple of comments > >> below: > >>>> > >>>> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote: > >>>> > >>>>> On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote: > >>>>>>> Originally, it was meant to have a number of > >>>>>>> long lived subscriptions, over which a lot of data travelled. Now > the > >>>>>>> load has flipped to a large number of short lived subscriptions, > over > >>>>>>> which relatively little data travels. > >>>>>> > >>>>>> The topic discussed here doesn't relate to hedwig subscriptions, it > >> just > >>>>>> about how hedwig use ledgers to store its messages. Even there are > no > >>>>>> subscriptions, the problem is still there. The restart of a hub > server > >>>>>> carrying large number of topics would hit the metadata storage with > >> many > >>>>>> accesses. The hit is a hub server acquiring a topic, no matter the > >>>>>> subscription is long lived or short lived. after topic is acquired, > >>>>>> following accesses are in memory, which doesn't cause any > >>>>>> performance issue. > >>>>> I was using topics and subscriptions to mean the same thing here due > >>>>> to the usecase we have in Yahoo where they're effectively the same > >>>>> thing. But yes, I should have said topic. But my point still > >>>>> stands. Hedwig was designed to deal with fewer topics, which had a > lot > >>>>> of data passing through them, rather than more topics, with very > >>>>> little data passing though them. This is why zk was consider suffient > >>>>> at that point, as tens of thousands of topics being recovered really > >>>>> isn't an issue for zk. The point I was driving at is that, the > usecase > >>>>> has changed in a big way, so it may require a big change to handle > it. > >>>>> > >>>> > >>>> About the restart of a hub, if I understand the problem, many topics > >> will > >>>> be acquired concurrently, inducing the load spike you're mentioning. > If > >>>> this observation is correct, can we acquire those topics lazily > instead > >> of > >>>> eagerly? > >>>> > >>>>>> But we should separate the capacity problem from the software > >> problem. A > >>>>>> high performance and scalable metadata storage would help for > >> resolving > >>>>>> capacity problem. but either implementing a new one or leveraging a > >> high > >>>>>> performance one doesn't change the fact that it still need so many > >>>> metadata > >>>>>> accesses to acquire topic. A bad implementation causing such many > >>>> metadata > >>>>>> accesses is a software problem. If we had chance to improve it, why > >>>>>> not? > >>>>> I don't think the implementation is bad, but rather the assumptions, > >>>>> as I said earlier. The data:metadata ratio has completely changed > >>>>> completely. hedwig/bk were designed with a data:metadata ratio of > >>>>> something like 100000:1. What we're talking about now is more like > 1:1 > >>>>> and therefore we need to be able to handle an order of magnitude more > >>>>> of metadata than previously. Bringing down the number of writes by an > >>>>> order of 2 or 3, while a nice optimisation, is just putting duct tape > >>>>> on the problem. > >>>> > >>>> > >>>> I love the duct tape analogy. The number of accesses to the metadata > is > >>>> still linear on the number of topics, and reducing the number of > >> accesses > >>>> per acquisition does not change the complexity. Perhaps distributing > the > >>>> accesses over time, optimized or not, might be a good way to proceed > >>>> overall. Another approach would be to group topics into buckets and > have > >>>> hubs acquiring buckets of topics, all topics of a bucket writing to > the > >>>> same ledgers. There is an obvious drawback to this approach when > reading > >>>> messages for a given topic. > >>>> > >>>>> > >>>>>> > >>>>>>> The ledger can still be read many times, but you have removed the > >>>>>> guarantee that what is read each time will be the same thing. > >>>>>> > >>>>>> How we guarantee a reader's behavior when a ledger is removed at the > >>>> same > >>>>>> time? We don't guarantee it right now, right? It is similar thing > for > >> a > >>>>>> 'shrink' operation which remove part of entries, while 'delete' > >>>> operation > >>>>>> removes whole entries? > >>>>>> > >>>>>> And if I remembered correctly, readers only see the same thing when > a > >>>>>> ledger is closed. What I proposed doesn't volatile this contract. > If > >> a > >>>>>> ledger is closed (state is in CLOSED), an application can't re-open > >> it. > >>>> If > >>>>>> a ledger isn't closed yet, an application can recover previous state > >> and > >>>>>> continue writing entries using this ledger. for applications, they > >> could > >>>>>> still use 'create-close-create' style to use ledgers, or evolve to > new > >>>> api > >>>>>> for efficiency smoothly, w/o breaking any backward compatibility. > >>>>> Ah, yes, I misread your proposal originally, I thought the reopen was > >>>>> working with an already closed ledger. > >>>>> > >>>>> On a side note, the reason we have an initial write for fencing, is > >>>>> that when the reading client(RC) fences, the servers in the ensemble > >>>>> start returning an error to the writing client (WC). At the moment we > >>>>> don't distinguish between a fencing error and a i/o error for > >>>>> example. So WC will try to rebuild a the ensemble by replacing the > >>>>> erroring servers. Before writing to the new ensemble, it has to > update > >>>>> the metadata, and at this point it will see that it has been > >>>>> fenced. With a specific FENCED error, we could avoid this write. This > >>>>> makes me uncomfortable though. What happens if the fenced server > fails > >>>>> between being fenced and WC trying to write? It will get a normal i/o > >>>>> error. And will try to replace the server. Since the metadata has not > >>>>> been changed, nothing will stop it, and it may be able to continue > >>>>> writing. I think this is also the case for the session fencing > >> solution. > >>>>> > >>>>> -Ivan > >>>> > >>>> > >> > >> > >
