re-clarify the points in the proposal to make it clear. 1) change ledger id to ledger name.
giving a name is a more natural way for human beings. id generation just eases the work naming a ledger. I don't see any optimization or efficiency that we gained in BookKeeper is based on a 'long' ledger id so far. 2) session fencing rather than fencing. same working mechanism as fencing. same correctness guarantee. 3) remove entries range rather than delete a ledger. same working mechanism as improved gc proposed in BOOKKEEPER-464. On Wed, Jan 16, 2013 at 10:58 PM, Sijie Guo <[email protected]> wrote: > > If this observation is correct, can we acquire those topics lazily > instead of eagerly? > > could you add entries to a ledger without opening ledger? > > > Another approach would be to group topics into buckets and have hubs > acquiring buckets of topics, all topics of a bucket writing to the same > ledgers. There is an obvious drawback to this approach when reading > messages for a given topic. > > for topic ownership, we could proceed it in groups as I proposed in > BOOKKEEPER-508. > for topic persistence info, we should not do that, otherwise it volatile > the contract that Hedwig made and you add complexity into Hedwig itself. > Grouping could be considered in application side, but if application could > not or is difficult to do that, we still have to face such many topics to > make Hedwig as a scalable platform. > > I don't like what you said 'the number of accesses to the metadata is > still linear on the number of topics'. Metadata accesses problem is an > 'entity locating' problem. Two categories of solutions are used to locate > an entity, one is mapping, the other one is computation. > > Mapping means you need to find a place to store this entity or store the > location of this entity. Like HDFS NameNode storing the locations of blocks > in Namenode. If an entity existed in this way, you had at least one access > to find it. no matter it was stored in ZooKeeper, in memory, embedded in > bookie servers or grouped in somewhere. You could not make any changes to > the 'linear' relationship between number of accesses and number of > entities. To make it more clearly, in HDFS, 'entity' means a data block, if > you had an entity location, read data is fast, but cross boundary of > blocks, a metadata lookup is required. Similarly, in Hedwig, 'entity' means > a topic, if the topic is owned, read messages in this topic is fast. But in > either case, the number of accesses is still linear on the number of > entities. > > Computation means I could compute the location of an entity based on its > name, without looking up somewhere. E.g, Consistent Hashing. > > Currently, BookKeeper only gives a ledger id, so Hedwig has to find a > place to record this ledger id. This is a 'Mapping'. So it is 2 levels > mapping. one is topic name to Hedwig's metadata, the other is Hedwig > metadata to Ledger's metadata. You could not get rid of 'linear' limitation > w/o breaking the mapping between Hedwig metadata and Ledger metadata. > > What I proposed is still 'linear'. But it consolidates these two metadata > into one, and breaks the mapping between Hedwig metadata and Ledger > metadata. Although Hedwig still requires one lookup from topic name to > metadata, it gives the opportunity to get rid of this lookup in future > since we could use 'ledger name' to access its entries now. > > An idea like 'placement group' in ceph could be deployed in BookKeeper in > future, which grouping ledgers to share same ensemble configuration. It > would reduce the number of mappings between ledger name to ledger metadata. > but it should be a different topic to discuss. All what I said here is just > to demonstrate the proposal I made here could make the improvement proceed > in right direction. > > > > > On Wed, Jan 16, 2013 at 1:18 PM, Flavio Junqueira > <[email protected]>wrote: > >> Thanks for the great discussion so far. I have a couple of comments below: >> >> On Jan 16, 2013, at 8:00 PM, Ivan Kelly <[email protected]> wrote: >> >> > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote: >> >>> Originally, it was meant to have a number of >> >>> long lived subscriptions, over which a lot of data travelled. Now the >> >>> load has flipped to a large number of short lived subscriptions, over >> >>> which relatively little data travels. >> >> >> >> The topic discussed here doesn't relate to hedwig subscriptions, it >> just >> >> about how hedwig use ledgers to store its messages. Even there are no >> >> subscriptions, the problem is still there. The restart of a hub server >> >> carrying large number of topics would hit the metadata storage with >> many >> >> accesses. The hit is a hub server acquiring a topic, no matter the >> >> subscription is long lived or short lived. after topic is acquired, >> >> following accesses are in memory, which doesn't cause any >> >> performance issue. >> > I was using topics and subscriptions to mean the same thing here due >> > to the usecase we have in Yahoo where they're effectively the same >> > thing. But yes, I should have said topic. But my point still >> > stands. Hedwig was designed to deal with fewer topics, which had a lot >> > of data passing through them, rather than more topics, with very >> > little data passing though them. This is why zk was consider suffient >> > at that point, as tens of thousands of topics being recovered really >> > isn't an issue for zk. The point I was driving at is that, the usecase >> > has changed in a big way, so it may require a big change to handle it. >> > >> >> About the restart of a hub, if I understand the problem, many topics will >> be acquired concurrently, inducing the load spike you're mentioning. If >> this observation is correct, can we acquire those topics lazily instead of >> eagerly? >> >> >> But we should separate the capacity problem from the software problem. >> A >> >> high performance and scalable metadata storage would help for resolving >> >> capacity problem. but either implementing a new one or leveraging a >> high >> >> performance one doesn't change the fact that it still need so many >> metadata >> >> accesses to acquire topic. A bad implementation causing such many >> metadata >> >> accesses is a software problem. If we had chance to improve it, why >> >> not? >> > I don't think the implementation is bad, but rather the assumptions, >> > as I said earlier. The data:metadata ratio has completely changed >> > completely. hedwig/bk were designed with a data:metadata ratio of >> > something like 100000:1. What we're talking about now is more like 1:1 >> > and therefore we need to be able to handle an order of magnitude more >> > of metadata than previously. Bringing down the number of writes by an >> > order of 2 or 3, while a nice optimisation, is just putting duct tape >> > on the problem. >> >> >> I love the duct tape analogy. The number of accesses to the metadata is >> still linear on the number of topics, and reducing the number of accesses >> per acquisition does not change the complexity. Perhaps distributing the >> accesses over time, optimized or not, might be a good way to proceed >> overall. Another approach would be to group topics into buckets and have >> hubs acquiring buckets of topics, all topics of a bucket writing to the >> same ledgers. There is an obvious drawback to this approach when reading >> messages for a given topic. >> >> > >> >> >> >>> The ledger can still be read many times, but you have removed the >> >> guarantee that what is read each time will be the same thing. >> >> >> >> How we guarantee a reader's behavior when a ledger is removed at the >> same >> >> time? We don't guarantee it right now, right? It is similar thing for a >> >> 'shrink' operation which remove part of entries, while 'delete' >> operation >> >> removes whole entries? >> >> >> >> And if I remembered correctly, readers only see the same thing when a >> >> ledger is closed. What I proposed doesn't volatile this contract. If a >> >> ledger is closed (state is in CLOSED), an application can't re-open >> it. If >> >> a ledger isn't closed yet, an application can recover previous state >> and >> >> continue writing entries using this ledger. for applications, they >> could >> >> still use 'create-close-create' style to use ledgers, or evolve to new >> api >> >> for efficiency smoothly, w/o breaking any backward compatibility. >> > Ah, yes, I misread your proposal originally, I thought the reopen was >> > working with an already closed ledger. >> > >> > On a side note, the reason we have an initial write for fencing, is >> > that when the reading client(RC) fences, the servers in the ensemble >> > start returning an error to the writing client (WC). At the moment we >> > don't distinguish between a fencing error and a i/o error for >> > example. So WC will try to rebuild a the ensemble by replacing the >> > erroring servers. Before writing to the new ensemble, it has to update >> > the metadata, and at this point it will see that it has been >> > fenced. With a specific FENCED error, we could avoid this write. This >> > makes me uncomfortable though. What happens if the fenced server fails >> > between being fenced and WC trying to write? It will get a normal i/o >> > error. And will try to replace the server. Since the metadata has not >> > been changed, nothing will stop it, and it may be able to continue >> > writing. I think this is also the case for the session fencing solution. >> > >> > -Ivan >> >> >
