On Aug 2, 2017 6:54 PM, "Yiming Zang" <yz...@twitter.com.invalid> wrote:
I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it turns out to me that the current implementation of RackawareEnsemblePlacementPolicy only enforces at least 2 different racks, it doesn't care about ensemble size, write quorum size or ack quorum size. For example, imagine we have the follow rack diversity (r1 means rack1): =========== BK1: r1 BK2: r2 BK3: r3 BK4: r3 BK5: r3 =========== Now we're creating a ledger with (3,3,2), in which ensemble size and write quorum size are both 2. My expected behavior is the ensemble must contains BK1 and BK2, and would choose the third bookie from BK3~BK5. However, in fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where two of the bookies share the same rack. I've also added a test case to validate this theory. Question: Is this behavior expected? Shall we use write quorum size as rack diversity so that we can spread the ledger across more racks? I think that was the expected behavior. The idea was choosing bookies at the same rack as many as possible and have replicas at a remote rack for availability. Because typically inter-rack latency will be higher than intra-rack. So the rack devesity enforcement isn't aligned with write quorum size. I think we can introduce a setting in rack aware placement policy like 'rack.diversity.enforced'. if it is true, the number of racks will be min (num of total racks, write quorum size). Sijie There core logic is here, where it use "~" to represent a different rack than the previous one: // pick nodes by racks, to ensure there is at least two racks per write quorum. for (int i = 0; i < ensembleSize; i++) { String curRack; if (null == prevNode) { if ((null == localNode) || localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) { curRack = NodeBase.ROOT; } else { curRack = localNode.getNetworkLocation(); } } else { curRack = "~" + prevNode.getNetworkLocation(); } prevNode = selectFromNetworkLocation(curRack, excludeNodes, ensemble, ensemble); } ArrayList<BookieSocketAddress> bookieList = ensemble.toList(); if (ensembleSize != bookieList.size()) { LOG.error("Not enough {} bookies are available to form an ensemble : {}.", ensembleSize, bookieList); throw new BKNotEnoughBookiesException(); } return bookieList;