Hello all,
Currently, the ledger id generation is implemented with zookeeper
(persist-/ephemeral-) sequential node to make a global unique id. In code
detail,
- FlatLedgerManager requires a write on zookeeper
- HierarchicalLedgerManager and MSLedgerManagerFactory use same approach
which includes a write and a delete operation to zookeeper.
Obviously, this ledger id generation process is too heavy, since what we
want is only a global unique id. Also there has been a JIRA
BOOKKEEPER-421<https://issues.apache.org/jira/browse/BOOKKEEPER-421> shows that
current ledger id space is limited to 32 bits by the cversion (int type) in
zookeeper node. So we need to enlarge the ledger id space to 64 bits.
Then there are two questions:
1. How to generate a 64 bits global unique id?
2. How to maintain the metadata for 64 bits ledger id in zookeeper?
(Absolutely, current 2-4-4 split for ledger id is not suitable, see
HierarchicalLedgerManager)
--------------I'm a split line for 64 bits ledger id
generation-----------------------------
For 64 bits global unique id generation, Flavio, Ivan, Sijie and I have a
discussion in mail, here are two proposals:
1. Let client generate the id itself (Ivan proposed): leverage zookeeper
session id as a unique part and client maintains a counter in memory. so the id
would be {session_id}{counter}.
2. Batch id generation (Jiannan proposed): use zookeeper znode as counter to
track generated ids. During the implementation, client asked zookeeper for a
counter range. after that, the id generation is proceeded locally w/o
contacting zookeeper.
For proposal 1, the performance would be very great since it's local
generation totally. But Sijie has one concern: "in reality, it seems that it
doesn't work. zookeeper session id is long, while ledger id is long, you could
not put session id as part of ledger id. otherwise, it would cause id
conflict..".
And then Flavio and Ivan suggest perhaps we could simply use a procedure
similar to the one used in ZooKeeper to generate and increment session ids in
ZooKeeper. But Sijie figure out that this process in zookeeper includes a
current system timestamp which may exhaust the 64 bits id space quickly. Also
Flavio is thinking of reusing ledger identifiers, but he address that there are
three scenarios if we reuse a ledger identifier:
1- The previous ledger still exists and its metadata is stored. In this
case, we can detect it when trying to create the metadata for the new ledger;
2- The previous ledger has been fully deleted (metadata + ledger
fragments);
3- Metadata for the previous ledger has been deleted, but the ledger
fragments haven't.
Flavio: "Case 1 can be easily detected, while case 2 causes no problem at
all. Case 3 is the problematic one, but I can't remember whether it can happen
or not given the way we do garbage collection currently. I need to review how
we do it, but in the case scenario 3 can happen, we could have the ledger
writers using different master keys, which would cause the bookie to return an
error when trying to write to a ledger that already exists."
For proposal 2, it still requires to access zookeeper but the write
frequency could be quite small once we set a large batch size (like 10000).
In summary, proposal 1 aims to generate a UUID/GUID like id in 64 bits
space, but the possibility of conflict should be taken into account and if the
id generated is not monotone we should take care of the case 3 listed above.
Proposal 2 has no problem on a quick monotone id generation, but the process
involves zookeeper.
By the way, I've submitted a patch in
BOOKKEEPER-438<https://issues.apache.org/jira/browse/BOOKKEEPER-438> to move
ledger id generation out of LedgerManager, and I'll add a conf setting in
another JIRA to give bookkeeper client a chance to customize his own id
generation idea. I'll appreciate if anyone can help to review on the patch
(thanks Sijie first).
--------------I'm a split line for 64 bits ledger id metadata
management-----------------------------
HierarchicalLedgerManager use 2-4-4 style to split current 10 chars ledger
id, E.g Ledger 0000000001 is splited into 3 parts 00,0000,0001 and stored in
zookeeper path "(ledgersRootPath)/00/0000/L0001". So each znode could have at
most 10000 ledgers, which avoids errors during garbage collection due to lists
of children that are too long.
After we enlarge the ledger id space to 64 bits, it's a big problem to
manage for large ledger id.
My idea is split the ledger id under the radix 2^13=8192 and then construct
it in a radix tree. For example, ledger id 2, 5, and 41093(==5X8192+133) then
the tree in zookeeper would be:
(ledger id root)
/ \
2 (meta) 5 (meta)
\
133 (meta)
So there will be at most 8192 children under each znode and the depth is
(64/13=5) at most.
Note that the inner znode will also record metadata, so if ledger id
generation is increasing step by step, then the depth of this radix tree only
grows as needed. And I guess it can handle all 2^64 ledger ids ideally.
Since speaking of metadata, I would like to share a test result we make
these two days. For HierarchicalLedgerManager , we observe that a ledger
metadata consumes 700+ bytes in zookeeper, this may possible because of
LedgerMetadata.serialize() uses a pure text format. But the data size is only
300+ bytes in ledger id node, and I guess the extra space is occupied by the
overhead of inner hierarchical node. What's more, the memory a topic consume is
2k with only 1 subscriber and no pub: there is no metadata for topic ownership
(since we now use consistent hash for topic ownership), and the metadata size
for subscription and persistence are both 8 bytes. I'll investigate more and
then issue a new topic on it.
Best,
Jiannan