We are experiencing an issue with our view change/recovery protocol during
the interactive mode of ResilientDB. This issue has been documented here
<https://github.com/apache/incubator-resilientdb/issues/128>. The summary
is provided below, and we will use this thread to explore and fix this
issue.

When forcefully inducing view change on the Resilient DB system, there are
two issues that can happen currently:

   1. After the view change occurs, a replica does not continue the
   transaction it was performing when the view change occurred.
   2. After the view change occurs, a replica is unable to
   receive/recognize transaction messages being sent towards it.

>From testing, type 1 is most likely an issue when the view change messages
are received and occur. It interrupts some process, which is midway, and
the program is unsure where to continue. This problem is fixed if
start_kv_service.sh is reran, so it is likely a runtime issue with timing.

For type 2, this is most likely an issue with the ports or memory, as what
I have noticed is when the view change runs into type 2, whenever I try a
view change on that computer session following that, it always results in a
type 2 error. Meanwhile, if the view change works fine the first time, all
subsequent star_kv_service.sh runs for that computer session(when the code
remains unchanged) to avoid the type 2 issue.

There is also an issue where sometimes, after a view change is done, 30+
prepare and commit messages are collected for the next transaction, and
transactions that were never sent are logged, resulting in a higher number
executed count and prepare messages collected count than what should be
happening.


---
Best Regards,
Mohammad Sadoghi, PhD
Associate Professor
Exploratory Systems Lab (ExpoLab)
Department of Computer Science
University of California, Davis

ExpoLab: https://expolab.org/
ResilientDB: https://resilientdb.com/
Phone: 914-319-7937

Reply via email to