> On Dec 9, 2016, at 10:08 AM, Randall Hauch <rha...@redhat.com> wrote:
> 
>> 
>> On Dec 9, 2016, at 3:13 AM, Radim Vansa <rva...@redhat.com 
>> <mailto:rva...@redhat.com>> wrote:
>> 
>> On 12/08/2016 10:13 AM, Gustavo Fernandes wrote:
>>> 
>>> I recently updated a proposal [1] based on several discussions we had 
>>> in the past that is essentially about introducing an event storage 
>>> mechanism (write ahead log) in order to improve reliability, failover 
>>> and "replayability" for the remote listeners, any feedback greatly 
>>> appreciated.
>> 
>> Hi Gustavo,
>> 
>> while I really like the pull-style architecture and reliable events, I 
>> see some problematic parts here:
>> 
>> 1) 'cache that would persist the events with a monotonically increasing id'
>> 
>> I assume that you mean globally (for all entries) monotonous. How will 
>> you obtain such ID? Currently, commands have unique IDs that are 
>> <Address, Long> where the number part is monotonous per node. That's 
>> easy to achieve. But introducing globally monotonous counter means that 
>> there will be a single contention point. (you can introduce another 
>> contention points by adding backups, but this is probably unnecessary as 
>> you can find out the last id from the indexed cache data). Per-segment 
>> monotonous would be probably more scalabe, though that increases complexity.
> 
> It is complicated, but one way to do this is to have one “primary” node 
> maintain the log and to have other replicate from it. The cluster does need 
> to use consensus to agree which is the primary, and to know which secondary 
> becomes the primary if the primary is failing. Consensus is not trivial, but 
> JGroups Raft (http://belaban.github.io/jgroups-raft/ 
> <http://belaban.github.io/jgroups-raft/>) may be an option. However, this 
> approach ensures that the replica logs are identical to the primary since 
> they are simply recording the primary’s log as-is. Of course, another 
> challenge is what happens during a failure of the primary log node, and can 
> any transactions be performed/completed while the primary is unavailable.
> 
> Another option is to have each node maintain their own log, and to have an 
> aggregator log that merges/combines the various logs into one. Not sure how 
> feasible it is to merge logs by getting rid of duplicates and determining a 
> total order, but if it is then it may have better fault tolerance 
> characteristics.
> 
> Of course, it is possible to have node-specific monotonic IDs. For example, 
> MySQL Global Transaction IDs (GTIDs) use a unique UUID for each node, and 
> then GTIDs consists of the node’s UUID plus a monotonically-increasing value 
> (e.g., “31fc48cd-ecd4-46ad-b0a9-f515fc9497c4:1001”). The transaction log 
> contains a mix of GTIDs, and MySQL replication uses a “GTID set” to describe 
> the ranges of transactions known by a server (e.g., 
> “u1:1-100,u2:1-10000,u3:3-5” where “u1”, “u2”, and “u3” are actually UUIDs). 
> So, when a MySQL replica connects, it says “I know about this GTID set", and 
> this tells the master where that client wants to start reading.

Emmanuel and I were talking offline. Another approach entirely is to have each 
node (optionally) write the changes it is making as a leader directly to Kafka, 
meaning that Kafka becomes the event log and delivery mechanism. Upon failure 
of that node, the node that becomes the new leader would write any of its 
events not already written by the former leader, and then continue writing new 
changes it is making as a leader. Thus, Infinispan would not be producing a 
single log with total order of all changes to a cache (which there isn’t one in 
Infinispan), but rather the total order of each key. (Kafka does this very 
nicely via topic partitions, where all changes for each key always get written 
to the same partition, and each partition has a total order.) This approach may 
still need separate “commit” events to reflect how Infinispan currently works 
internally.

Obviously Infinispan wouldn’t require this to be done, but when it’s enabled it 
might provide a much simpler way of capturing the history of changes to the 
events in an Infinispan cache. The HotRod client could consume the events 
directly from Kafka, or that could be left to a completely different 
client/utility. It does add a dependency on Kafka, but it means the Infinispan 
community doesn’t need to build much of the same functionality.

> 
>> 
>> 2) 'The write to the event log would be async in order to not affect 
>> normal data writes'
>> 
>> Who should write to the cache?
>> a) originator - what if originator crashes (despite the change has been 
>> added)? Besides, originator would have to do (async) RPC to primary 
>> owner (which will be the primary owner of the event, too).
>> b) primary owner - with triangle, primary does not really know if the 
>> change has been written on backup. Piggybacking that info won't be 
>> trivial - we don't want to send another message explicitly. But even if 
>> we get the confirmation, since the write to event cache is async, if the 
>> primary owner crashes before replicating the event to backup, we lost 
>> the event
>> c) all owners, but locally - that will require more complex 
>> reconciliation if the event did really happen on all surviving nodes or 
>> not. And backups could have some trouble to resolve order, too.
>> 
>> IIUC clustered listeners are called from primary owner before the change 
>> is really confirmed on backups (@Pedro correct me if I am wrong, 
>> please), but for this reliable event cache you need higher level of 
>> consistency.
> 
> This could be handled by writing a confirmation or “commit” event to the log 
> when the write is confirmed or the transaction is committed. Then, only those 
> confirmed events/transactions would be exposed to client listeners. This 
> requires some buffering, but this could be done in each HotRod client.
> 
>> 
>> 3) The log will also have to filter out retried operations (based on 
>> command ID - though this can be indexed, too). Though, I would prefer to 
>> see per-event command-id log to deal with retries properly.
> 
> IIUC, a “commit” event would work here, too.
> 
>> 
>> 4) Client should pull data, but I would keep push notifications that 
>> 'something happened' (throttled on server). There could be use case for 
>> rarely updated caches, and polling the servers would be excessive there.
> 
> IMO the clients should poll, but if the server has nothing to return it 
> blocks until there is something or until a timeout occurs. This makes it easy 
> for clients and actually reduces network traffic compared to constantly 
> polling.
> 
> BTW, a lot of this is replicating the functionality of Kafka, which is 
> already quite mature and feature rich. It’s actually possible to *embed* 
> Kafka to simplify operations, but I don’t think that’s recommended. And, it 
> introduces a very complex codebase that would need to be supported.
> 
>> 
>> Radim
>> 
>>> 
>>> 
>>> [1] 
>>> https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvement-proposal
>>>  
>>> <https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvement-proposal>
>>> 
>>> Thanks,
>>> Gustavo
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev@lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> -- 
>> Radim Vansa <rva...@redhat.com <mailto:rva...@redhat.com>>
>> JBoss Performance Team
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev@lists.jboss.org <mailto:infinispan-dev@lists.jboss.org>
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org <mailto:infinispan-dev@lists.jboss.org>
> https://lists.jboss.org/mailman/listinfo/infinispan-dev 
> <https://lists.jboss.org/mailman/listinfo/infinispan-dev>
_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Reply via email to