On Wed, Jul 30, 2014 at 12:00 PM, Pedro Ruivo <pe...@infinispan.org> wrote:
> > > On 07/30/2014 09:02 AM, Dan Berindei wrote: > > > > > > > if your proposal is only meant to apply to non-tx caches, you are right > > you don't have to worry about multiple primary owners... most of the > > time. But when the primary owner changes, then you do have 2 primary > > owners (if the new primary owner installs the new topology first), and > > you do need to coordinate between the 2. > > > > I think it is the same for transactional cache. I.e. the commands wait > for the transaction data from the new topology to be installed. In the > non-tx caches, the old primary owner will send the next "sequence > number" to the new primary owner and only after that, the new primary > owner starts to give the orders. > I'm not sure that's related, commands that wait for a newer topology do not block a thread since the ISPN-3527 fix. > > Otherwise, I can implement a total order version for non-tx caches and > all the write serialization would be done in JGroups and Infinispan only > has to apply the updates as soon as they are delivered. > Right, that sounds quite interesting. But you'd also need a less-blocking state transfer ;) > > Slightly related: we also considered generating a version number on the > > client for consistency when the HotRod client retries after a primary > > owner failure [1]. But the clients can't create a monotonic sequence > > number, so we couldn't use that version number for this. > > > > [1] https://issues.jboss.org/browse/ISPN-2956 > > > > > > Also I don't see it as an alternative to TOA, I rather expect it to > > work nicely together: when TOA is enabled you could trust the > > originating sequence source rather than generate a per-entry > sequence, > > and in neither case you need to actually use a Lock. > > I haven't thought how the sequences would need to interact (if they > > need), but they seem complementary to resolve different aspects, and > > also both benefit from the same cleanup and basic structure. > > > > > > We don't acquire locks at all on the backup owners - either in tx or > > non-tx caches. If state transfer is in progress, we use > > ConcurrentHashMap.compute() to store tracking information, which uses a > > synchronized block, so I suppose we do acquire locks. I assume your > > proposal would require a DataContainer.compute() or something similar on > > the backups, to ensure that the version check and the replacement are > > atomic. > > > > I still think TOA does what you want for tx caches. Your proposal would > > only work for non-tx caches, so you couldn't use them together. > > > > > > >> Another aspect is that the "user thread" on the primary owner > > needs to > > >> wait (at least until we improve further) and only proceed after > ACK > > >> from backup nodes, but this is better modelled through a state > > >> machine. (Also discussed in Farnborough). > > > > > > > > > To be clear, I don't think keeping the user thread on the > > originator blocked > > > until we have the write confirmations from all the backups is a > > problem - a > > > sync operation has to block, and it also serves to rate-limit user > > > operations. > > > > > > There are better ways to rate-limit than to make all operations slow; > > we don't need to block a thread, we need to react on the reply from > > the backup owners. > > You still have an inherent rate-limit in the outgoing packet queues: > > if these fill up, then and only then it's nice to introduce some back > > pressure. > > > > > > Sorry, you got me confused when you called the thread on the primary > > owner a "user thread". I agree that internal stuff can and should be > > asynchronous, callback based, but the user still has to see a > > synchronous blocking operation. > > > > > > > The problem appears when the originator is not the primary owner, > > and the > > > thread blocking for backup ACKs is from the remote-executor pool > > (or OOB, > > > when the remote-executor pool is exhausted). > > > > Not following. I guess this is out of scope now that I clarified the > > proposed solution is only to be applied between primary and backups? > > > > > > Yeah, I was just trying to clarify that there is no danger of exhausting > > the remote executor/OOB thread pools when the originator of the write > > command is the primary owner (as it happens in the HotRod server). > > > > > > >> > > >> It's also conceptually linked to: > > >> - https://issues.jboss.org/browse/ISPN-1599 > > >> As you need to separate the locks of entries from the effective > user > > >> facing lock, at least to implement transactions on top of this > > model. > > > > > > > > > I think we fixed ISPN-1599 when we changed passivation to use > > > DataContainer.compute(). WDYT Pedro, is there anything else you'd > > like to do > > > in the scope of ISPN-1599? > > > > > >> > > >> I expect this to improve performance in a very significant way, > but > > >> it's getting embarrassing that it's still not done; at the next > face > > >> to face meeting we should also reserve some time for > retrospective > > >> sessions. > > > > > > > > > Implementing the state machine-based interceptor stack may give > us a > > > performance boost, but I'm much more certain that it's a very > > complex, high > > > risk task... and we don't have a stable test suite yet :) > > > > Cleaning up and removing some complexity such as > > TooManyExecutorsException might help to get it stable, and keep it > > there :) > > BTW it was quite stable for me until you changed the JGroups UDP > > default configuration. > > > > > > Do you really use UDP to run the tests? The default is TCP, but maybe > > the some tests doesn't use TestCacheManagerFactory... > > > > I was just aligning our configs with Bela's recommandations: MERGE3 > > instead of MERGE2 and the removal of UFC in TCP stacks. If they cause > > problems on your machine, you should make more noise :) > > > > Dan > > > > Sanne > > > > > > > > > > >> > > >> > > >> Sanne > > >> > > >> On 29 July 2014 15:50, Bela Ban <b...@redhat.com > > <mailto:b...@redhat.com>> wrote: > > >> > > > >> > > > >> > On 29/07/14 16:42, Dan Berindei wrote: > > >> >> Have you tried regular optimistic/pessimistic transactions as > > well? > > >> > > > >> > Yes, in my first impl. but since I'm making only 1 change per > > request, I > > >> > thought a TX is overkill. > > >> > > > >> >> They *should* have less issues with the OOB thread pool than > > non-tx > > >> >> mode, and > > >> >> I'm quite curious how they stack against TO in such a large > > cluster. > > >> > > > >> > Why would they have fewer issues with the thread pools ? AIUI, > > a TX > > >> > involves 2 RPCs (PREPARE-COMMIT/ROLLBACK) compared to one when > > not using > > >> > TXs. And we're sync anyway... > > >> > > > >> > > > >> >> On Tue, Jul 29, 2014 at 5:38 PM, Bela Ban <b...@redhat.com > > <mailto:b...@redhat.com> > > >> >> <mailto:b...@redhat.com <mailto:b...@redhat.com>>> wrote: > > >> >> > > >> >> Following up on my own email, I changed the config to use > > Pedro's > > >> >> excellent total order implementation: > > >> >> > > >> >> <transaction transactionMode="TRANSACTIONAL" > > >> >> transactionProtocol="TOTAL_ORDER" lockingMode="OPTIMISTIC" > > >> >> useEagerLocking="true" eagerLockSingleNode="true"> > > >> >> <recovery enabled="false"/> > > >> >> > > >> >> With 100 nodes and 25 requester threads/node, I did NOT > > run into > > >> >> any > > >> >> locking issues ! > > >> >> > > >> >> I could even go up to 200 requester threads/node and the > > perf was ~ > > >> >> 7'000-8'000 requests/sec/node. Not too bad ! > > >> >> > > >> >> This really validates the concept of lockless total-order > > >> >> dissemination > > >> >> of TXs; for the first time, this has been tested on a > > large(r) > > >> >> scale > > >> >> (previously only on 25 nodes) and IT WORKS ! :-) > > >> >> > > >> >> I still believe we should implement my suggested solution > for > > >> >> non-TO > > >> >> configs, but short of configuring thread pools of 1000 > > threads or > > >> >> higher, I hope TO will allow me to finally test a 500 node > > >> >> Infinispan > > >> >> cluster ! > > >> >> > > >> >> > > >> >> On 29/07/14 15:56, Bela Ban wrote: > > >> >> > Hi guys, > > >> >> > > > >> >> > sorry for the long post, but I do think I ran into an > > important > > >> >> problem > > >> >> > and we need to fix it ... :-) > > >> >> > > > >> >> > I've spent the last couple of days running the > > IspnPerfTest [1] > > >> >> perftest > > >> >> > on Google Compute Engine (GCE), and I've run into a > > problem with > > >> >> > Infinispan. It is a design problem and can be > mitigated by > > >> >> sizing > > >> >> thread > > >> >> > pools correctly, but cannot be eliminated entirely. > > >> >> > > > >> >> > > > >> >> > Symptom: > > >> >> > -------- > > >> >> > IspnPerfTest has every node in a cluster perform > > 20'000 requests > > >> >> on keys > > >> >> > in range [1..20000]. > > >> >> > > > >> >> > 80% of the requests are reads and 20% writes. > > >> >> > > > >> >> > By default, we have 25 requester threads per node and > > 100 nodes > > >> >> in a > > >> >> > cluster, so a total of 2500 requester threads. > > >> >> > > > >> >> > The cache used is NON-TRANSACTIONAL / dist-sync / 2 > > owners: > > >> >> > > > >> >> > <namedCache name="clusteredCache"> > > >> >> > <clustering mode="distribution"> > > >> >> > <stateTransfer awaitInitialTransfer="true"/> > > >> >> > <hash numOwners="2"/> > > >> >> > <sync replTimeout="20000"/> > > >> >> > </clustering> > > >> >> > > > >> >> > <transaction transactionMode="NON_TRANSACTIONAL" > > >> >> > useEagerLocking="true" > > >> >> > eagerLockSingleNode="true" /> > > >> >> > <locking lockAcquisitionTimeout="5000" > > >> >> concurrencyLevel="1000" > > >> >> > isolationLevel="READ_COMMITTED" > > >> >> useLockStriping="false" /> > > >> >> > </namedCache> > > >> >> > > > >> >> > It has 2 owners, a lock acquisition timeout of 5s and > > a repl > > >> >> timeout of > > >> >> > 20s. Lock stripting is off, so we have 1 lock per key. > > >> >> > > > >> >> > When I run the test, I always get errors like those > below: > > >> >> > > > >> >> > org.infinispan.util.concurrent.TimeoutException: > Unable to > > >> >> acquire lock > > >> >> > after [10 seconds] on key [19386] for requestor > > >> >> [Thread[invoker-3,5,main]]! > > >> >> > Lock held by > [Thread[OOB-194,ispn-perf-test,m5.1,5,main]] > > >> >> > > > >> >> > and > > >> >> > > > >> >> > org.infinispan.util.concurrent.TimeoutException: Node > > m8.1 timed > > >> >> out > > >> >> > > > >> >> > > > >> >> > Investigation: > > >> >> > ------------ > > >> >> > When I looked at UNICAST3, I saw a lot of missing > > messages on > > >> >> the > > >> >> > receive side and unacked messages on the send side. > > This caused > > >> >> me to > > >> >> > look into the (mainly OOB) thread pools and - voila - > > maxed out > > >> >> ! > > >> >> > > > >> >> > I learned from Pedro that the Infinispan internal > > thread pool > > >> >> (with a > > >> >> > default of 32 threads) can be configured, so I > > increased it to > > >> >> 300 and > > >> >> > increased the OOB pools as well. > > >> >> > > > >> >> > This mitigated the problem somewhat, but when I > > increased the > > >> >> requester > > >> >> > threads to 100, I had the same problem again. > > Apparently, the > > >> >> Infinispan > > >> >> > internal thread pool uses a rejection policy of "run" > > and thus > > >> >> uses the > > >> >> > JGroups (OOB) thread when exhausted. > > >> >> > > > >> >> > I learned (from Pedro and Mircea) that GETs and PUTs > > work as > > >> >> follows in > > >> >> > dist-sync / 2 owners: > > >> >> > - GETs are sent to the primary and backup owners and > > the first > > >> >> response > > >> >> > received is returned to the caller. No locks are > > acquired, so > > >> >> GETs > > >> >> > shouldn't cause problems. > > >> >> > > > >> >> > - A PUT(K) is sent to the primary owner of K > > >> >> > - The primary owner > > >> >> > (1) locks K > > >> >> > (2) updates the backup owner synchronously > > *while holding > > >> >> the lock* > > >> >> > (3) releases the lock > > >> >> > > > >> >> > > > >> >> > Hypothesis > > >> >> > ---------- > > >> >> > (2) above is done while holding the lock. The sync > > update of the > > >> >> backup > > >> >> > owner is done with the lock held to guarantee that the > > primary > > >> >> and > > >> >> > backup owner of K have the same values for K. > > >> >> > > > >> >> > However, the sync update *inside the lock scope* slows > > things > > >> >> down (can > > >> >> > it also lead to deadlocks?); there's the risk that the > > request > > >> >> is > > >> >> > dropped due to a full incoming thread pool, or that > > the response > > >> >> is not > > >> >> > received because of the same, or that the locking at > > the backup > > >> >> owner > > >> >> > blocks for some time. > > >> >> > > > >> >> > If we have many threads modifying the same key, then > > we have a > > >> >> backlog > > >> >> > of locking work against that key. Say we have 100 > > requester > > >> >> threads and > > >> >> > a 100 node cluster. This means that we have 10'000 > threads > > >> >> accessing > > >> >> > keys; with 2'000 writers there's a big chance that > > some writers > > >> >> pick the > > >> >> > same key at the same time. > > >> >> > > > >> >> > For example, if we have 100 threads accessing key K > > and it takes > > >> >> 3ms to > > >> >> > replicate K to the backup owner, then the last of the > 100 > > >> >> threads > > >> >> waits > > >> >> > ~300ms before it gets a chance to lock K on the > > primary owner > > >> >> and > > >> >> > replicate it as well. > > >> >> > > > >> >> > Just a small hiccup in sending the PUT to the primary > > owner, > > >> >> sending the > > >> >> > modification to the backup owner, waitting for the > > response, or > > >> >> GC, and > > >> >> > the delay will quickly become bigger. > > >> >> > > > >> >> > > > >> >> > Verification > > >> >> > ---------- > > >> >> > To verify the above, I set numOwners to 1. This means > > that the > > >> >> primary > > >> >> > owner of K does *not* send the modification to the > > backup owner, > > >> >> it only > > >> >> > locks K, modifies K and unlocks K again. > > >> >> > > > >> >> > I ran the IspnPerfTest again on 100 nodes, with 25 > > requesters, > > >> >> and NO > > >> >> > PROBLEM ! > > >> >> > > > >> >> > I then increased the requesters to 100, 150 and 200 > > and the test > > >> >> > completed flawlessly ! Performance was around *40'000 > > requests > > >> >> per node > > >> >> > per sec* on 4-core boxes ! > > >> >> > > > >> >> > > > >> >> > Root cause > > >> >> > --------- > > >> >> > ******************* > > >> >> > The root cause is the sync RPC of K to the backup > > owner(s) of K > > >> >> while > > >> >> > the primary owner holds the lock for K. > > >> >> > ******************* > > >> >> > > > >> >> > This causes a backlog of threads waiting for the lock > > and that > > >> >> backlog > > >> >> > can grow to exhaust the thread pools. First the > Infinispan > > >> >> internal > > >> >> > thread pool, then the JGroups OOB thread pool. The > > latter causes > > >> >> > retransmissions to get dropped, which compounds the > > problem... > > >> >> > > > >> >> > > > >> >> > Goal > > >> >> > ---- > > >> >> > The goal is to make sure that primary and backup > > owner(s) of K > > >> >> have the > > >> >> > same value for K. > > >> >> > > > >> >> > Simply sending the modification to the backup owner(s) > > >> >> asynchronously > > >> >> > won't guarantee this, as modification messages might > get > > >> >> processed out > > >> >> > of order as they're OOB ! > > >> >> > > > >> >> > > > >> >> > Suggested solution > > >> >> > ---------------- > > >> >> > The modification RPC needs to be invoked *outside of > > the lock > > >> >> scope*: > > >> >> > - lock K > > >> >> > - modify K > > >> >> > - unlock K > > >> >> > - send modification to backup owner(s) // outside the > > lock scope > > >> >> > > > >> >> > The primary owner puts the modification of K into a > > queue from > > >> >> where a > > >> >> > separate thread/task removes it. The thread then > > invokes the > > >> >> PUT(K) on > > >> >> > the backup owner(s). > > >> >> > > > >> >> > The queue has the modified keys in FIFO order, so the > > >> >> modifications > > >> >> > arrive at the backup owner(s) in the right order. > > >> >> > > > >> >> > This requires that the way GET is implemented changes > > slightly: > > >> >> instead > > >> >> > of invoking a GET on all owners of K, we only invoke > > it on the > > >> >> primary > > >> >> > owner, then the next-in-line etc. > > >> >> > > > >> >> > The reason for this is that the backup owner(s) may > > not yet have > > >> >> > received the modification of K. > > >> >> > > > >> >> > This is a better impl anyway (we discussed this > > before) becuse > > >> >> it > > >> >> > generates less traffic; in the normal case, all but 1 > GET > > >> >> requests are > > >> >> > unnecessary. > > >> >> > > > >> >> > > > >> >> > > > >> >> > Improvement > > >> >> > ----------- > > >> >> > The above solution can be simplified and even made more > > >> >> efficient. > > >> >> > Re-using concepts from IRAC [2], we can simply store > the > > >> >> modified > > >> >> *keys* > > >> >> > in the modification queue. The modification > > replication thread > > >> >> removes > > >> >> > the key, gets the current value and invokes a > > PUT/REMOVE on the > > >> >> backup > > >> >> > owner(s). > > >> >> > > > >> >> > Even better: a key is only ever added *once*, so if we > > have > > >> >> [5,2,17,3], > > >> >> > adding key 2 is a no-op because the processing of key > > 2 (in > > >> >> second > > >> >> > position in the queue) will fetch the up-to-date value > > anyway ! > > >> >> > > > >> >> > > > >> >> > Misc > > >> >> > ---- > > >> >> > - Could we possibly use total order to send the > > updates in TO ? > > >> >> TBD (Pedro?) > > >> >> > > > >> >> > > > >> >> > Thoughts ? > > >> >> > > > >> >> > > > >> >> > [1] https://github.com/belaban/IspnPerfTest > > >> >> > [2] > > >> >> > > > >> >> > > >> >> > > > https://github.com/infinispan/infinispan/wiki/RAC:-Reliable-Asynchronous-Clustering > > >> >> > > > >> >> > > >> >> -- > > >> >> Bela Ban, JGroups lead (http://www.jgroups.org) > > >> >> _______________________________________________ > > >> >> infinispan-dev mailing list > > >> >> infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org> > > >> >> <mailto:infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org>> > > >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> _______________________________________________ > > >> >> infinispan-dev mailing list > > >> >> infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org> > > >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > >> >> > > >> > > > >> > -- > > >> > Bela Ban, JGroups lead (http://www.jgroups.org) > > >> > _______________________________________________ > > >> > infinispan-dev mailing list > > >> > infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org> > > >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > >> _______________________________________________ > > >> infinispan-dev mailing list > > >> infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org> > > >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > > > > > > > > > _______________________________________________ > > > infinispan-dev mailing list > > > infinispan-dev@lists.jboss.org > > <mailto:infinispan-dev@lists.jboss.org> > > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > _______________________________________________ > > infinispan-dev mailing list > > infinispan-dev@lists.jboss.org <mailto: > infinispan-dev@lists.jboss.org> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > > > > > > > _______________________________________________ > > infinispan-dev mailing list > > infinispan-dev@lists.jboss.org > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev >
_______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev