Hi,

We are looking at how ZooKeeper handles silent data corruptions resulting
from underlying problems in disks and file systems atop them [1,2].

We set up a 3-node ZooKeeper cluster and introduce silent data corruptions
to different blocks in the on-disk files. In all the cases, ZooKeeper is
able to detect corruptions in the log file using checksums.

However, on detecting a corruption, the ZooKeeper node in which corruption
occurred crashes instead of trying to fix the corrupted data automatically
using the replicas. Why does ZooKeeper not fix the corrupted entry
automatically using replicas? What is the reason for this design decision?
It would be helpful if anyone could give some insights on this.

[1] https://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf
[2] http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

Thanks,
Aishwarya

Reply via email to