Hello all,
   Currently, the ledger id generation is implemented with zookeeper 
(persist-/ephemeral-) sequential node to make a global unique id. In code 
detail,
      - FlatLedgerManager requires a write on zookeeper
      - HierarchicalLedgerManager and MSLedgerManagerFactory use same approach 
which includes a write and a delete operation to zookeeper.
   Obviously, this ledger id generation process is too heavy, since what we 
want is only a global unique id. Also there has been a JIRA 
BOOKKEEPER-421<https://issues.apache.org/jira/browse/BOOKKEEPER-421> shows that 
current ledger id space is limited to 32 bits by the cversion (int type) in 
zookeeper node. So we need to enlarge the ledger id space to 64 bits.

   Then there are two questions:
      1. How to generate a 64 bits global unique id?
      2. How to maintain the metadata for 64 bits ledger id in zookeeper? 
(Absolutely, current 2-4-4 split for ledger id is not suitable, see 
HierarchicalLedgerManager)

--------------I'm a split line for 64 bits ledger id 
generation-----------------------------

For 64 bits global unique id generation, Flavio, Ivan, Sijie and I have a 
discussion in mail, here are two proposals:
   1. Let client generate the id itself (Ivan proposed): leverage zookeeper 
session id as a unique part and client maintains a counter in memory. so the id 
would be {session_id}{counter}.
   2. Batch id generation (Jiannan proposed): use zookeeper znode as counter to 
track generated ids. During the implementation, client asked zookeeper for a 
counter range. after that, the id generation is proceeded locally w/o 
contacting zookeeper.

   For proposal 1, the performance would be very great since it's local 
generation totally. But Sijie has one concern: "in reality, it seems that it 
doesn't work. zookeeper session id is long, while ledger id is long, you could 
not put session id as part of ledger id. otherwise, it would cause id 
conflict..".
   And then Flavio and Ivan suggest perhaps we could simply use a procedure 
similar to the one used in ZooKeeper to generate and increment session ids in 
ZooKeeper. But Sijie figure out that this process in zookeeper includes a 
current system timestamp which may exhaust the 64 bits id space quickly. Also 
Flavio is thinking of reusing ledger identifiers, but he address that there are 
three scenarios if we reuse a ledger identifier:
      1- The previous ledger still exists and its metadata is stored. In this 
case, we can detect it when trying to create the metadata for the new ledger;
      2- The previous ledger has been fully deleted (metadata +  ledger 
fragments);
      3- Metadata for the previous ledger has been deleted, but the ledger 
fragments haven't.
   Flavio: "Case 1 can be easily detected, while case 2 causes no problem at 
all. Case 3 is the problematic one, but I can't remember whether it can happen 
or not given the way we do garbage collection currently. I need to review how 
we do it, but in the case scenario 3 can happen, we could have the ledger 
writers using different master keys, which would cause the bookie to return an 
error when trying to write to a ledger that already exists."

   For proposal 2, it still requires to access zookeeper but the write 
frequency could be quite small once we set a large batch size (like 10000).

   In summary, proposal 1 aims to generate a UUID/GUID like id in 64 bits 
space, but the possibility of conflict should be taken into account and if the 
id generated is not monotone we should take care of the case 3 listed above. 
Proposal 2 has no problem on a quick monotone id generation, but the process 
involves zookeeper.
   By the way, I've submitted a patch in 
BOOKKEEPER-438<https://issues.apache.org/jira/browse/BOOKKEEPER-438> to move 
ledger id generation out of LedgerManager, and I'll add a conf setting in 
another JIRA to give bookkeeper client a chance to customize his own id 
generation idea. I'll appreciate if anyone can help to review on the patch 
(thanks Sijie first).

--------------I'm a split line for 64 bits ledger id metadata 
management-----------------------------

   HierarchicalLedgerManager use 2-4-4 style to split current 10 chars ledger 
id, E.g Ledger 0000000001 is splited into 3 parts 00,0000,0001 and stored in 
zookeeper path "(ledgersRootPath)/00/0000/L0001". So each znode could have at 
most 10000 ledgers, which avoids errors during garbage collection due to lists 
of children that are too long.
   After we enlarge the ledger id space to 64 bits, it's a big problem to 
manage for large ledger id.

   My idea is split the ledger id under the radix 2^13=8192 and then construct 
it in a radix tree. For example, ledger id 2, 5, and 41093(==5X8192+133) then 
the tree in zookeeper would be:
         (ledger id root)
            /      \
        2 (meta)   5 (meta)
                     \
                  133 (meta)
   So there will be at most 8192 children under each znode and the depth is 
(64/13=5) at most.
   Note that the inner znode will also record metadata, so if ledger id 
generation is increasing step by step, then the depth of this radix tree only 
grows as needed. And I guess it can handle all 2^64 ledger ids ideally.

   Since speaking of metadata, I would like to share a test result we make 
these two days. For HierarchicalLedgerManager , we observe that a ledger 
metadata consumes 700+ bytes in zookeeper, this may possible because of 
LedgerMetadata.serialize() uses a pure text format. But the data size is only 
300+ bytes in ledger id node, and I guess the extra space is occupied by the 
overhead of inner hierarchical node. What's more, the memory a topic consume is 
2k with only 1 subscriber and no pub: there is no metadata for topic ownership 
(since we now use consistent hash for topic ownership), and the metadata size 
for subscription and persistence are both 8 bytes. I'll investigate more and 
then issue a new topic on it.


Best,
Jiannan

Reply via email to