Hello, Recently I've started expanding my Riak cluster and found that handoffs were continuously retried for one partition.
Here are logs from two nodes https://gist.github.com/vshabanov/41282e622479fbe81974 The most interesting parts of logs are "Handoff receiver for partition ... exited abnormally after processing 2860338 objects: {{badarg,[{erlang,binary_to_term,..." and "bad argument in call to erlang:binary_to_term(<<131,104,...." Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously). When I've printed corrupted binary string I found that it corresponds to one value. When I've tried to "get" it, it was read OK but node with corrupted value shown the same binary_to_term error. When I've tried to delete corrupted value I've got timeout. I'm running machines with ECC memory and ZFS filesystem (which doesn't report any checksum failures) so I doubt data was silently corrupted on disk. LOG from corresponding LevelDB partition doesn't show any errors. But there is a lost/BLOCKS.bad file in this partition (7kb, created more than a month ago and looks like it doesn't contain corrupted value). At the moment I've stopped handoffs using "risk-admin transfer-limit 0". Why the value was corrupted? It there any way to remove it or fix it?
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
