On Tue, Jul 29, 2014 at 5:50 PM, Bela Ban <b...@redhat.com> wrote: > > > On 29/07/14 16:42, Dan Berindei wrote: > > Have you tried regular optimistic/pessimistic transactions as well? > > Yes, in my first impl. but since I'm making only 1 change per request, I > thought a TX is overkill. >
You are using txs with TO, right? > > > They *should* have less issues with the OOB thread pool than non-tx > mode, and > > I'm quite curious how they stack against TO in such a large cluster. > > Why would they have fewer issues with the thread pools ? AIUI, a TX > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when not using > TXs. And we're sync anyway... > > Actually, 2 sync RPCs (prepare + commit) and 1 async RPC (tx completion notification). But we only keep the user thread busy across RPCs (unless L1 is enabled), so we actually need less OOB/remote threads. > > > On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <b...@redhat.com > > <mailto:b...@redhat.com>> wrote: > > > > Following up on my own email, I changed the config to use Pedro's > > excellent total order implementation: > > > > <transaction transactionMode="TRANSACTIONAL" > > transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC" > > useEagerLocking="true" eagerLockSingleNode="true"> > > <recovery enabled="false"/> > > > > With 100 nodes and 25 requester threads/node, I did NOT run into any > > locking issues ! > > > > I could even go up to 200 requester threads/node and the perf was ~ > > 7'000-8'000 requests/sec/node. Not too bad ! > > > > This really validates the concept of lockless total-order > dissemination > > of TXs; for the first time, this has been tested on a large(r) scale > > (previously only on 25 nodes) and IT WORKS ! :-) > > > > I still believe we should implement my suggested solution for non-TO > > configs, but short of configuring thread pools of 1000 threads or > > higher, I hope TO will allow me to finally test a 500 node Infinispan > > cluster ! > > > > > > On 29/07/14 15:56, Bela Ban wrote: > > > Hi guys, > > > > > > sorry for the long post, but I do think I ran into an important > > problem > > > and we need to fix it ... :-) > > > > > > I've spent the last couple of days running the IspnPerfTest [1] > > perftest > > > on Google Compute Engine (GCE), and I've run into a problem with > > > Infinispan. It is a design problem and can be mitigated by sizing > > thread > > > pools correctly, but cannot be eliminated entirely. > > > > > > > > > Symptom: > > > -------- > > > IspnPerfTest has every node in a cluster perform 20'000 requests > > on keys > > > in range [1..20000]. > > > > > > 80% of the requests are reads and 20% writes. > > > > > > By default, we have 25 requester threads per node and 100 nodes > in a > > > cluster, so a total of 2500 requester threads. > > > > > > The cache used is NON-TRANSACTIONAL / dist-sync / 2 owners: > > > > > > <namedCache name="clusteredCache"> > > > <clustering mode="distribution"> > > > <stateTransfer awaitInitialTransfer="true"/> > > > <hash numOwners="2"/> > > > <sync replTimeout="20000"/> > > > </clustering> > > > > > > <transaction transactionMode="NON_TRANSACTIONAL" > > > useEagerLocking="true" > > > eagerLockSingleNode="true" /> > > > <locking lockAcquisitionTimeout="5000" > concurrencyLevel="1000" > > > isolationLevel="READ_COMMITTED" > > useLockStriping="false" /> > > > </namedCache> > > > > > > It has 2 owners, a lock acquisition timeout of 5s and a repl > > timeout of > > > 20s. Lock stripting is off, so we have 1 lock per key. > > > > > > When I run the test, I always get errors like those below: > > > > > > org.infinispan.util.concurrent.TimeoutException: Unable to > > acquire lock > > > after [10 seconds] on key [19386] for requestor > > [Thread[invoker-3,5,main]]! > > > Lock held by [Thread[OOB-194,ispn-perf-test,m5.1,5,main]] > > > > > > and > > > > > > org.infinispan.util.concurrent.TimeoutException: Node m8.1 timed > out > > > > > > > > > Investigation: > > > ------------ > > > When I looked at UNICAST3, I saw a lot of missing messages on the > > > receive side and unacked messages on the send side. This caused > me to > > > look into the (mainly OOB) thread pools and - voila - maxed out ! > > > > > > I learned from Pedro that the Infinispan internal thread pool > (with a > > > default of 32 threads) can be configured, so I increased it to > > 300 and > > > increased the OOB pools as well. > > > > > > This mitigated the problem somewhat, but when I increased the > > requester > > > threads to 100, I had the same problem again. Apparently, the > > Infinispan > > > internal thread pool uses a rejection policy of "run" and thus > > uses the > > > JGroups (OOB) thread when exhausted. > > > > > > I learned (from Pedro and Mircea) that GETs and PUTs work as > > follows in > > > dist-sync / 2 owners: > > > - GETs are sent to the primary and backup owners and the first > > response > > > received is returned to the caller. No locks are acquired, so GETs > > > shouldn't cause problems. > > > > > > - A PUT(K) is sent to the primary owner of K > > > - The primary owner > > > (1) locks K > > > (2) updates the backup owner synchronously *while holding > > the lock* > > > (3) releases the lock > > > > > > > > > Hypothesis > > > ---------- > > > (2) above is done while holding the lock. The sync update of the > > backup > > > owner is done with the lock held to guarantee that the primary and > > > backup owner of K have the same values for K. > > > > > > However, the sync update *inside the lock scope* slows things > > down (can > > > it also lead to deadlocks?); there's the risk that the request is > > > dropped due to a full incoming thread pool, or that the response > > is not > > > received because of the same, or that the locking at the backup > owner > > > blocks for some time. > > > > > > If we have many threads modifying the same key, then we have a > > backlog > > > of locking work against that key. Say we have 100 requester > > threads and > > > a 100 node cluster. This means that we have 10'000 threads > accessing > > > keys; with 2'000 writers there's a big chance that some writers > > pick the > > > same key at the same time. > > > > > > For example, if we have 100 threads accessing key K and it takes > > 3ms to > > > replicate K to the backup owner, then the last of the 100 threads > > waits > > > ~300ms before it gets a chance to lock K on the primary owner and > > > replicate it as well. > > > > > > Just a small hiccup in sending the PUT to the primary owner, > > sending the > > > modification to the backup owner, waitting for the response, or > > GC, and > > > the delay will quickly become bigger. > > > > > > > > > Verification > > > ---------- > > > To verify the above, I set numOwners to 1. This means that the > > primary > > > owner of K does *not* send the modification to the backup owner, > > it only > > > locks K, modifies K and unlocks K again. > > > > > > I ran the IspnPerfTest again on 100 nodes, with 25 requesters, > and NO > > > PROBLEM ! > > > > > > I then increased the requesters to 100, 150 and 200 and the test > > > completed flawlessly ! Performance was around *40'000 requests > > per node > > > per sec* on 4-core boxes ! > > > > > > > > > Root cause > > > --------- > > > ******************* > > > The root cause is the sync RPC of K to the backup owner(s) of K > while > > > the primary owner holds the lock for K. > > > ******************* > > > > > > This causes a backlog of threads waiting for the lock and that > > backlog > > > can grow to exhaust the thread pools. First the Infinispan > internal > > > thread pool, then the JGroups OOB thread pool. The latter causes > > > retransmissions to get dropped, which compounds the problem... > > > > > > > > > Goal > > > ---- > > > The goal is to make sure that primary and backup owner(s) of K > > have the > > > same value for K. > > > > > > Simply sending the modification to the backup owner(s) > asynchronously > > > won't guarantee this, as modification messages might get > > processed out > > > of order as they're OOB ! > > > > > > > > > Suggested solution > > > ---------------- > > > The modification RPC needs to be invoked *outside of the lock > scope*: > > > - lock K > > > - modify K > > > - unlock K > > > - send modification to backup owner(s) // outside the lock scope > > > > > > The primary owner puts the modification of K into a queue from > > where a > > > separate thread/task removes it. The thread then invokes the > > PUT(K) on > > > the backup owner(s). > > > > > > The queue has the modified keys in FIFO order, so the > modifications > > > arrive at the backup owner(s) in the right order. > > > > > > This requires that the way GET is implemented changes slightly: > > instead > > > of invoking a GET on all owners of K, we only invoke it on the > > primary > > > owner, then the next-in-line etc. > > > > > > The reason for this is that the backup owner(s) may not yet have > > > received the modification of K. > > > > > > This is a better impl anyway (we discussed this before) becuse it > > > generates less traffic; in the normal case, all but 1 GET > > requests are > > > unnecessary. > > > > > > > > > > > > Improvement > > > ----------- > > > The above solution can be simplified and even made more efficient. > > > Re-using concepts from IRAC [2], we can simply store the modified > > *keys* > > > in the modification queue. The modification replication thread > > removes > > > the key, gets the current value and invokes a PUT/REMOVE on the > > backup > > > owner(s). > > > > > > Even better: a key is only ever added *once*, so if we have > > [5,2,17,3], > > > adding key 2 is a no-op because the processing of key 2 (in second > > > position in the queue) will fetch the up-to-date value anyway ! > > > > > > > > > Misc > > > ---- > > > - Could we possibly use total order to send the updates in TO ? > > TBD (Pedro?) > > > > > > > > > Thoughts ? > > > > > > > > > [1] https://github.com/belaban/IspnPerfTest > > > [2] > > > > > > https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering > > > > > > > -- > > Bela Ban, JGroups lead (http://www.jgroups.org) > > _______________________________________________ > > infinispan-dev mailing list > > infinispan-dev@lists.jboss.org <mailto: > infinispan-dev@lists.jboss.org> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > > > > > > > _______________________________________________ > > infinispan-dev mailing list > > infinispan-dev@lists.jboss.org > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > -- > Bela Ban, JGroups lead (http://www.jgroups.org) > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev >
_______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev