We are experiencing an issue with our view change/recovery protocol during the interactive mode of ResilientDB. This issue has been documented here <https://github.com/apache/incubator-resilientdb/issues/128>. The summary is provided below, and we will use this thread to explore and fix this issue.
When forcefully inducing view change on the Resilient DB system, there are two issues that can happen currently: 1. After the view change occurs, a replica does not continue the transaction it was performing when the view change occurred. 2. After the view change occurs, a replica is unable to receive/recognize transaction messages being sent towards it. >From testing, type 1 is most likely an issue when the view change messages are received and occur. It interrupts some process, which is midway, and the program is unsure where to continue. This problem is fixed if start_kv_service.sh is reran, so it is likely a runtime issue with timing. For type 2, this is most likely an issue with the ports or memory, as what I have noticed is when the view change runs into type 2, whenever I try a view change on that computer session following that, it always results in a type 2 error. Meanwhile, if the view change works fine the first time, all subsequent star_kv_service.sh runs for that computer session(when the code remains unchanged) to avoid the type 2 issue. There is also an issue where sometimes, after a view change is done, 30+ prepare and commit messages are collected for the next transaction, and transactions that were never sent are logged, resulting in a higher number executed count and prepare messages collected count than what should be happening. --- Best Regards, Mohammad Sadoghi, PhD Associate Professor Exploratory Systems Lab (ExpoLab) Department of Computer Science University of California, Davis ExpoLab: https://expolab.org/ ResilientDB: https://resilientdb.com/ Phone: 914-319-7937