On 11/25/2015 04:43 PM, Dan Berindei wrote: > On Wed, Nov 25, 2015 at 5:15 PM, Radim Vansa <[email protected]> wrote: >> On 11/25/2015 03:24 PM, Pedro Ruivo wrote: >>> On 11/25/2015 01:20 PM, Radim Vansa wrote: >>>> On 11/25/2015 12:07 PM, Sanne Grinovero wrote: >>>>> On 25 November 2015 at 10:48, Pedro Ruivo <[email protected]> wrote: >>>>>>> An alternative is to wait for all ACKs, but I think this could still >>>>>>> be optimised in "triangle shape" too by having the Originator only >>>>>>> wait for the ACKs from the non-primary replicas? >>>>>>> So backup owners have to send a confirmation message to the >>>>>>> Originator, while the Primary owner isn't expecting to do so. >>>>>> IMO, we should wait for all ACKs to keep our read design. >>>> What exactly is our 'read design'? >>> If we don't wait for all the ACKs, then we have to go to the primary >>> owner for reads, even if the originator is a Backup owner. >> I don't think so, but we probably have som miscom. If O = B, we still >> wait for reply from B (which is local) which is triggered by receiving >> an update from P (after applying the change locally). So it goes >> >> OB(application thread) [cache.put()] -(unordered)-> P(worker thread) >> [applies update] -(ordered)-> OB(worker thread) [applies update] >> -(in-VM)-> OB(application thread) [continues] > In your example, O still has to receive a message from P with the > previous value. The previous value may be included in the update sent > by the primary, or it may be sent in a separate message, but O still > has to receive the previous value somehow. Including the previous > value in the backup update command is not necessary in general (except > for FunctionalMap's commands, maybe?), so I'd rather use a separate > message.
All right, in case that we need the previous value it really makes sense to send it to O directly. > >>>> I think that the source of optimization is that once primary decides to >>>> backup the operation, he can forget about it and unlock the entry. So, >>>> we don't need any ACK from primary unless it's an exception/noop >>>> notification (as with conditional ops). If primary waited for ACK from >>>> backup, we wouldn't save anything. >>> About the iteration between P -> B, you're right. We don't need to wait >>> for the ACKs if the messages are sent in FIFO (and JGroups guarantee that) >>> >>> About the O -> P, IMO, the Originator should wait for the reply from >>> Backup. >> I was never claiming otherwise, O always needs to wait for ACK from Bs - >> only then it can successfully report that value has been written on all >> owners. What does this have to do with O -> P? > Right, this is the thing I should have brought up during the > meeting... if we only wait for the Ack from one B, then P can crash > after we confirmed to the application but before all Bs have received > the update message, and there will be nobody to retransmit/retry the > command => inconsistency. We've been mixing the N-owners and 2-owners case here a bit, so let me clarify; anytime I've written that an ack is expected from B, I meant from all backups (but not necessarily from primary). The case with more backups also shows that when a return value other than 'true/false=applied/did not apply the update' is needed, we should send the response directly from P, because we don't want to send relay it through all Bs (or pick one 'special'). > >>> At least, the Primary would be the only one who needs to return >>> the previous value (if needed) and it can return if the operation >>> succeed or not. >> Simple success: no P -> O, B -> O (success) >> Simple failure/non-modifying operation (as with putIfAbsent/functional >> call): P -> O (failure/custom value), no B -> O >> previous/custom value (as with replace() or functional call): P -> O >> (previous/custom value), B -> O (success); alternative is P -> B >> (previous/custom value, new value) and B -> O (previous/custom value) >> Exception on either P or B: send the exception to O >> Lost/timed out P -> B: O times out waiting for ack from B, throws exception >> > Like I said above, I would prefer it if P would send the previous > value directly to O (if necessary). Otherwise yeah, I don't see any > problem with O waiting for replies from P and Bs in parallel. Agreed. > > We've talked several times about removing the replication timeout and > assuming that a node will always reply in a timely manner to a > command, unless it's not available. Maybe this time we'll really do it > :) That would make sense to me once we have true async calls implemented - then, if you want to have timeout-able operation, you would just do cache.putAsync().get(my timeout). But I don't promote async calls when these consume thread from limited threadpool. > >>> This way, it would avoid forking the code for each type >>> of command without any benefit (I'm thinking sending the reply to >>> originator in parallel with the update message to the backups). >> What forking of code for each type do you mean? I see that there are two >> branches whether the command is going to be replicated to B or not. > I believe Pedro was talking about having P send the previous value > directly to O, and so having different handling of replies on O based > on whether we expect a previous value or not. I'm not that worried > about it, one way to handle the difference would be to use > ResponseMode.GET_ALL when the previous value is needed, and GET_NONE > otherwise. If the first implementation does not support omitting simple ack P -> O, that's fine. But when designing, please don't block the path for a nice optimization. > > Anyway, I think instead of jumping into implementation and fixing bugs > as they pop up, this time it may be better to build a model and > validate it first... then we can discuss changing details on the > model, and checking them as well. I volunteered to do this with > Knossos, we'll see how that goes (and when I'll have the time to > actually work on it...) No objections :) If you get any interesting results from model checking, I am one big ear. Radim > > Dan > > >> Radim >> >>>> The gains are: >>>> * less hops (3 instead of 4 if O != P && O != B) >>>> * less messages (primary ACK is transitive based on ack from B) >>>> * shorter lock times (not locking during P -> B RPC) >>>> >>>>>> However, the >>>>>> Originator needs to wait for the ACK from Primary because of conditional >>>>>> operations and functional API. >>>>> If the operation is successful, Primary will have to let the >>>>> secondaries know so these can reply to the Originator directly: still >>>>> saves an hop. >>> As I said above: "I'm thinking sending the reply to originator in >>> parallel with the update message to the backups" >>> >>>>>> In this first case, if the conditional operation fail, the Backups are >>>>>> not bothered. The latter case, we may need the return value from the >>>>>> function. >>>>> Right, for a failed or rejected operation the secondaries won't even >>>>> know about it, >>>>> so the Primary is in charge of letting the Originator know. >>>>> Essentially you're highlighting that the Originator needs to wait for >>>>> either the response from secondaries (all of them?) >>>>> or from the Primary. >>>>> >>>>>>> I suspect the tricky part is what happens when the Primary owner rules >>>>>>> +1 to apply the change, but then the backup owners (all or some of >>>>>>> them) somehow fail before letting the Originator know. The Originator >>>>>>> in this case should seek confirmation about its operation state >>>>>>> (success?) with the Primary owner; this implies that the Primary owner >>>>>>> needs to keep track of what it's applied and track failures too, and >>>>>>> this log needs to be pruned. >>>> Currently, in case of lost (timed out) ACK from B to P, we just report >>>> exception and don't care about synchronizing P and B - B can already >>>> store updated value. >>>> So we don't have to care about rollback on P if replication to B fails >>>> either - we just report that it's broken, sorry. >>>> Better consolidation API would be nice, though, something like >>>> cache.getAllVersions(). >>>> >>>> Radim >>>> >>>> >>>>>>> Sounds pretty nice, or am I missing other difficulties? >>>>>>> >>>>>>> Thanks, >>>>>>> Sanne >>>>>>> _______________________________________________ >>>>>>> infinispan-dev mailing list >>>>>>> [email protected] >>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>>>>> >>>>>> _______________________________________________ >>>>>> infinispan-dev mailing list >>>>>> [email protected] >>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>>> _______________________________________________ >>>>> infinispan-dev mailing list >>>>> [email protected] >>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> _______________________________________________ >>> infinispan-dev mailing list >>> [email protected] >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> >> -- >> Radim Vansa <[email protected]> >> JBoss Performance Team >> >> _______________________________________________ >> infinispan-dev mailing list >> [email protected] >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Radim Vansa <[email protected]> JBoss Performance Team _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
