First of all, I don't bound to any special case in this proposal. Since it would make an performance improvement (either in restarting hub servers or ledger changing) and solve some problems caused by current working style, no matter how many topics owned by a hub server. It is generic benefiting every users in BookKeeper. I added some references in that gist, to be clarify, I added them here again. Especially, Latency spike already found in BOOKKEEPER-448 according to some previous load testings. And the proposal here really get rid of metadata accesses in a publish request path, which is critical to most of the applications.
BOOKKEEPER-448: reduce publish latency when changing ledgers ( https://issues.apache.org/jira/browse/BOOKKEEPER-448) BOOKKEEPER-449: Zombie ledgers existed in Hedwig ( https://issues.apache.org/jira/browse/BOOKKEEPER-449) Secondly, for metadata improvements, it would not be an easy work, and could be processed in several dimensions and improved by iterations. And this proposal is just first step to optimize it, I don't say it would resolve everything. And so far I don't see any side effects that it would affect other solutions made in future. The last thing, about you comment on fencing. I am not very clear about your point here. > What happens if the fenced server fails > between being fenced and WC trying to write? It will get a normal i/o > error. WC got a normal I/O, entering ensemble change logic. But it could not do ensemble change, since ledger state is already changed from OPEN to IN_RECOVERY. so it would fail. If it doesn't stop and fail, it is a bug for fencing, right? Please keep in mind, fencing is guarantee by bother metadata CAS and fence state in bookie servers. We changed the metadata before proceeding any actions. It would guarantee one is succeed and the other one is failed. So similar mechanism as session fencing. one would succeed in incrementing the session id and gained the permission to proceed actions. If fencing works correctly, session fencing would works correctly too. -Sijie On Wed, Jan 16, 2013 at 11:00 AM, Ivan Kelly <[email protected]> wrote: > On Tue, Jan 15, 2013 at 09:33:35PM -0800, Sijie Guo wrote: > > > Originally, it was meant to have a number of > > > long lived subscriptions, over which a lot of data travelled. Now the > > > load has flipped to a large number of short lived subscriptions, over > > > which relatively little data travels. > > > > The topic discussed here doesn't relate to hedwig subscriptions, it just > > about how hedwig use ledgers to store its messages. Even there are no > > subscriptions, the problem is still there. The restart of a hub server > > carrying large number of topics would hit the metadata storage with many > > accesses. The hit is a hub server acquiring a topic, no matter the > > subscription is long lived or short lived. after topic is acquired, > > following accesses are in memory, which doesn't cause any > > performance issue. > I was using topics and subscriptions to mean the same thing here due > to the usecase we have in Yahoo where they're effectively the same > thing. But yes, I should have said topic. But my point still > stands. Hedwig was designed to deal with fewer topics, which had a lot > of data passing through them, rather than more topics, with very > little data passing though them. This is why zk was consider suffient > at that point, as tens of thousands of topics being recovered really > isn't an issue for zk. The point I was driving at is that, the usecase > has changed in a big way, so it may require a big change to handle it. > > > But we should separate the capacity problem from the software problem. A > > high performance and scalable metadata storage would help for resolving > > capacity problem. but either implementing a new one or leveraging a high > > performance one doesn't change the fact that it still need so many > metadata > > accesses to acquire topic. A bad implementation causing such many > metadata > > accesses is a software problem. If we had chance to improve it, why > > not? > I don't think the implementation is bad, but rather the assumptions, > as I said earlier. The data:metadata ratio has completely changed > completely. hedwig/bk were designed with a data:metadata ratio of > something like 100000:1. What we're talking about now is more like 1:1 > and therefore we need to be able to handle an order of magnitude more > of metadata than previously. Bringing down the number of writes by an > order of 2 or 3, while a nice optimisation, is just putting duct tape > on the problem. > > > > > > The ledger can still be read many times, but you have removed the > > guarantee that what is read each time will be the same thing. > > > > How we guarantee a reader's behavior when a ledger is removed at the same > > time? We don't guarantee it right now, right? It is similar thing for a > > 'shrink' operation which remove part of entries, while 'delete' operation > > removes whole entries? > > > > And if I remembered correctly, readers only see the same thing when a > > ledger is closed. What I proposed doesn't volatile this contract. If a > > ledger is closed (state is in CLOSED), an application can't re-open it. > If > > a ledger isn't closed yet, an application can recover previous state and > > continue writing entries using this ledger. for applications, they could > > still use 'create-close-create' style to use ledgers, or evolve to new > api > > for efficiency smoothly, w/o breaking any backward compatibility. > Ah, yes, I misread your proposal originally, I thought the reopen was > working with an already closed ledger. > > On a side note, the reason we have an initial write for fencing, is > that when the reading client(RC) fences, the servers in the ensemble > start returning an error to the writing client (WC). At the moment we > don't distinguish between a fencing error and a i/o error for > example. So WC will try to rebuild a the ensemble by replacing the > erroring servers. Before writing to the new ensemble, it has to update > the metadata, and at this point it will see that it has been > fenced. With a specific FENCED error, we could avoid this write. This > makes me uncomfortable though. What happens if the fenced server fails > between being fenced and WC trying to write? It will get a normal i/o > error. And will try to replace the server. Since the metadata has not > been changed, nothing will stop it, and it may be able to continue > writing. I think this is also the case for the session fencing solution. > > -Ivan >
