Hi,

I was the developer of the viewchage module. When the view change occurs, a
replica would switch to a viewchange mode and stop processing all messages
except those related to viewchange. Thus, I think the two types of issues
that you have mentioned are actually what the replicas are supposed to do
during the view change. But if you have different opinions about the 2
issues, please provide a more detailed description of them.

Though I don’t think the two issues you mentioned are bugs, the strange
number of the collected prepare and commit messages is something that is
highly possible to be a bug. I think a possible reason for this is that
when broadcasting prepare and commit messages after receiving a valid
NewView message, the replicas fail to set the “seq” of the messages
correctly. This is one possible reason, but we need to explore the log to
figure out the true reason.


On Thu, Jan 11, 2024 at 9:28 PM Mohammad Sadoghi <mo.sado...@expolab.org>
wrote:

> We are experiencing an issue with our view change/recovery protocol during
> the interactive mode of ResilientDB. This issue has been documented here
> <https://github.com/apache/incubator-resilientdb/issues/128>. The summary
> is provided below, and we will use this thread to explore and fix this
> issue.
>
> When forcefully inducing view change on the Resilient DB system, there are
> two issues that can happen currently:
>
>    1. After the view change occurs, a replica does not continue the
>    transaction it was performing when the view change occurred.
>    2. After the view change occurs, a replica is unable to
>    receive/recognize transaction messages being sent towards it.
>
> From testing, type 1 is most likely an issue when the view change messages
> are received and occur. It interrupts some process, which is midway, and
> the program is unsure where to continue. This problem is fixed if
> start_kv_service.sh is reran, so it is likely a runtime issue with timing.
>
> For type 2, this is most likely an issue with the ports or memory, as what
> I have noticed is when the view change runs into type 2, whenever I try a
> view change on that computer session following that, it always results in a
> type 2 error. Meanwhile, if the view change works fine the first time, all
> subsequent star_kv_service.sh runs for that computer session(when the code
> remains unchanged) to avoid the type 2 issue.
>
> There is also an issue where sometimes, after a view change is done, 30+
> prepare and commit messages are collected for the next transaction, and
> transactions that were never sent are logged, resulting in a higher number
> executed count and prepare messages collected count than what should be
> happening.
>
>
> ---
> Best Regards,
> Mohammad Sadoghi, PhD
> Associate Professor
> Exploratory Systems Lab (ExpoLab)
> Department of Computer Science
> University of California, Davis
>
> ExpoLab: https://expolab.org/
> ResilientDB: https://resilientdb.com/
> Phone: 914-319-7937
>


-- 
Best Regards,
Dakai Kang, PhD Student
Exploratory Systems Lab (ExpoLab)
Department of Computer Science
University of California, Davis

Reply via email to