My team uses akka-persistence 2.3.3 and akka-persistence-cassandra 0.3.1. 
 Recently, in production, my team observed a View that did not appear to be 
polling as expected.  The application had been running for about 12 hours 
(and previously has run for much longer without issue).  Updates to 
Cassandra did not propagate to the consuming application.  The consumer did 
not emit any error level logging (in production, logging is set to error). 
 The application is run on multiple nodes.  Restarting one application 
instance fixed the issue (i.e. the View read all events on start-up and 
continued polling as expected).

Having limited instrumentation available, there is not much else that I can 
specify with certainty about the running instances with suspected broken 
Views.  The View actor is created with the default supervision strategy 
(i.e. restart on exception), which rules out the scenario that the actor 
was stopped.  Additionally, local tests were performed to confirm this 
behavior in the event of an exception.  

The hypothesis my team formed to explain the situation is that perhaps a 
call to Cassandra via the akka-persistence-cassandra journal never 
returned.  There are several issues related to the DataStax driver (e.g. 
https://datastax-oss.atlassian.net/browse/JAVA-268) that might be at play 
here.  These issues appear to be resolved in 2.0.4, while 
akka-persistence-cassandra is compiled against 2.0.1.  My team will upgrade 
accordingly.

Assuming this is the issue, I want to voice my concern about how 
akka-persistence handles journals that fail to return a response. 
 Following the code, akka.persistence.Recovery tells the journal to read:
journal ! ReplayMessages(lastSequenceNr + 1L, toSnr, replayMax, 
processorId, self)

Then, based on the response type (success/failure), appropriate callbacks 
are invoked until ultimately in View, onReplayComplete() is invoked.  This 
function is responsible for scheduling the next polling attempt.  If the 
journal fails to respond, then the View never seeks to poll again because 
there is no timeout mechanism (that I am aware of).

If what I'm talking through holds water, would it make sense to consider 
adding a timeout to the View to ensure it continues to attempt polling for 
updates?  It could also make sense to instrument a policy for reporting an 
error when this stale condition is discovered.  I'm happy to think through 
the proposed enhancements further should the hypothesis be validated.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to