Thanks Jiannan for providing and organizing the discussions. currently bookkeeper could not handle id confliction perfectly so far especially for the case 3 Flavio mentioned. So we heavily replied on an implementation of id generator need to generate unique ledger id, which means the id generation requires a centralized place to track or co-ordinate. (e.g zookeeper). It is not perfect, but better as Jiannan proposed. But since Jiannan has provided the patch to separate id generation from ledger manager, it would be helpful to the community to evaluate better and fast id generation algorithm. So let's improve it by iterations.
> ledger metadata management. thinking more about radix tree Jiannan proposed. several comments on it: 1) how to delete a ledger? so you need to take care of telling from a leave znode from a inner znode in your implementation. 2) taking care of iterating the tree in order, since Scan And Compare gc algorithm relies on the order. But if we evolved to improved gc algorithm, this is not be a problem. maybe we need to start pushing the changes for improved gc algorithm into trunk after 4.2.0 released. 3) could ZooKeeper#getChildren carry 8192 children? Flavio, do you have any number about it? or maybe we need some experiments to ensure it works. 4) if this change wants to be applied on HierarchicalLedgerManager, it would be better to consider how to make the change smoothly, since it is different organized format. from the above comments, I would suggest that you could start this work as a new ledger manager w/o affecting other ledger managers. And apply the idea back to HierarchicalLedgerManager later if possible. -Sijie On Thu, Jan 17, 2013 at 10:17 AM, Jiannan Wang <[email protected]>wrote: > Hello all, > Currently, the ledger id generation is implemented with zookeeper > (persist-/ephemeral-) sequential node to make a global unique id. In code > detail, > - FlatLedgerManager requires a write on zookeeper > - HierarchicalLedgerManager and MSLedgerManagerFactory use same > approach which includes a write and a delete operation to zookeeper. > Obviously, this ledger id generation process is too heavy, since what > we want is only a global unique id. Also there has been a JIRA > BOOKKEEPER-421<https://issues.apache.org/jira/browse/BOOKKEEPER-421> > shows that current ledger id space is limited to 32 bits by the cversion > (int type) in zookeeper node. So we need to enlarge the ledger id space to > 64 bits. > > Then there are two questions: > 1. How to generate a 64 bits global unique id? > 2. How to maintain the metadata for 64 bits ledger id in zookeeper? > (Absolutely, current 2-4-4 split for ledger id is not suitable, see > HierarchicalLedgerManager) > > --------------I'm a split line for 64 bits ledger id > generation----------------------------- > > For 64 bits global unique id generation, Flavio, Ivan, Sijie and I have a > discussion in mail, here are two proposals: > 1. Let client generate the id itself (Ivan proposed): leverage > zookeeper session id as a unique part and client maintains a counter in > memory. so the id would be {session_id}{counter}. > 2. Batch id generation (Jiannan proposed): use zookeeper znode as > counter to track generated ids. During the implementation, client asked > zookeeper for a counter range. after that, the id generation is proceeded > locally w/o contacting zookeeper. > > For proposal 1, the performance would be very great since it's local > generation totally. But Sijie has one concern: "in reality, it seems that > it doesn't work. zookeeper session id is long, while ledger id is long, you > could not put session id as part of ledger id. otherwise, it would cause id > conflict..". > And then Flavio and Ivan suggest perhaps we could simply use a > procedure similar to the one used in ZooKeeper to generate and increment > session ids in ZooKeeper. But Sijie figure out that this process in > zookeeper includes a current system timestamp which may exhaust the 64 bits > id space quickly. Also Flavio is thinking of reusing ledger identifiers, > but he address that there are three scenarios if we reuse a ledger > identifier: > 1- The previous ledger still exists and its metadata is stored. In > this case, we can detect it when trying to create the metadata for the new > ledger; > 2- The previous ledger has been fully deleted (metadata + ledger > fragments); > 3- Metadata for the previous ledger has been deleted, but the ledger > fragments haven't. > Flavio: "Case 1 can be easily detected, while case 2 causes no problem > at all. Case 3 is the problematic one, but I can't remember whether it can > happen or not given the way we do garbage collection currently. I need to > review how we do it, but in the case scenario 3 can happen, we could have > the ledger writers using different master keys, which would cause the > bookie to return an error when trying to write to a ledger that already > exists." > > For proposal 2, it still requires to access zookeeper but the write > frequency could be quite small once we set a large batch size (like 10000). > > In summary, proposal 1 aims to generate a UUID/GUID like id in 64 bits > space, but the possibility of conflict should be taken into account and if > the id generated is not monotone we should take care of the case 3 listed > above. Proposal 2 has no problem on a quick monotone id generation, but the > process involves zookeeper. > By the way, I've submitted a patch in BOOKKEEPER-438< > https://issues.apache.org/jira/browse/BOOKKEEPER-438> to move ledger id > generation out of LedgerManager, and I'll add a conf setting in another > JIRA to give bookkeeper client a chance to customize his own id generation > idea. I'll appreciate if anyone can help to review on the patch (thanks > Sijie first). > > --------------I'm a split line for 64 bits ledger id metadata > management----------------------------- > > HierarchicalLedgerManager use 2-4-4 style to split current 10 chars > ledger id, E.g Ledger 0000000001 is splited into 3 parts 00,0000,0001 and > stored in zookeeper path "(ledgersRootPath)/00/0000/L0001". So each znode > could have at most 10000 ledgers, which avoids errors during garbage > collection due to lists of children that are too long. > After we enlarge the ledger id space to 64 bits, it's a big problem to > manage for large ledger id. > > My idea is split the ledger id under the radix 2^13=8192 and then > construct it in a radix tree. For example, ledger id 2, 5, and > 41093(==5X8192+133) then the tree in zookeeper would be: > (ledger id root) > / \ > 2 (meta) 5 (meta) > \ > 133 (meta) > So there will be at most 8192 children under each znode and the depth > is (64/13=5) at most. > Note that the inner znode will also record metadata, so if ledger id > generation is increasing step by step, then the depth of this radix tree > only grows as needed. And I guess it can handle all 2^64 ledger ids ideally. > > Since speaking of metadata, I would like to share a test result we make > these two days. For HierarchicalLedgerManager , we observe that a ledger > metadata consumes 700+ bytes in zookeeper, this may possible because of > LedgerMetadata.serialize() uses a pure text format. But the data size is > only 300+ bytes in ledger id node, and I guess the extra space is occupied > by the overhead of inner hierarchical node. What's more, the memory a topic > consume is 2k with only 1 subscriber and no pub: there is no metadata for > topic ownership (since we now use consistent hash for topic ownership), and > the metadata size for subscription and persistence are both 8 bytes. I'll > investigate more and then issue a new topic on it. > > > Best, > Jiannan > >
