On Aug 2, 2017 6:54 PM, "Yiming Zang" <yz...@twitter.com.invalid> wrote:

I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it turns
out to me that the current implementation of
RackawareEnsemblePlacementPolicy only enforces at least 2 different racks,
it doesn't care about ensemble size, write quorum size or ack quorum size.

For example, imagine we have the follow rack diversity (r1 means rack1):

===========
BK1: r1
BK2: r2
BK3: r3
BK4: r3
BK5: r3
===========

Now we're creating a ledger with (3,3,2), in which ensemble size and write
quorum size are both 2. My expected behavior is the ensemble must contains
BK1 and BK2, and would choose the third bookie from BK3~BK5. However, in
fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where two
of the bookies share the same rack.

I've also added a test case to validate this theory.

Question: Is this behavior expected? Shall we use write quorum size as rack
diversity so that we can spread the ledger across more racks?


I think that was the expected behavior. The idea was choosing bookies at
the same rack as many as possible and have replicas at a remote rack for
availability. Because typically inter-rack latency will be higher than
intra-rack. So the rack devesity enforcement isn't aligned with write
quorum size.

I think we can introduce a setting in rack aware placement policy like
'rack.diversity.enforced'. if it is true, the number of racks will be min
(num of total racks, write quorum size).

Sijie


There core logic is here, where it use "~" to represent a different rack
than the previous one:

// pick nodes by racks, to ensure there is at least two racks per write
quorum.
for (int i = 0; i < ensembleSize; i++) {
    String curRack;
    if (null == prevNode) {
        if ((null == localNode) ||

localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) {
            curRack = NodeBase.ROOT;
        } else {
            curRack = localNode.getNetworkLocation();
        }
    } else {
        curRack = "~" + prevNode.getNetworkLocation();
    }
    prevNode = selectFromNetworkLocation(curRack, excludeNodes,
ensemble, ensemble);
}
ArrayList<BookieSocketAddress> bookieList = ensemble.toList();
if (ensembleSize != bookieList.size()) {
    LOG.error("Not enough {} bookies are available to form an ensemble :
{}.",
              ensembleSize, bookieList);
    throw new BKNotEnoughBookiesException();
}
return bookieList;

Reply via email to