Corrupted Erlang binary term inside LevelDB

Vladimir Shabanov Wed, 24 Jul 2013 18:42:13 -0700

Hello,

Recently I've started expanding my Riak cluster and found that handoffs
were continuously retried for one partition.


Here are logs from two nodes
https://gist.github.com/vshabanov/41282e622479fbe81974

The most interesting parts of logs are
"Handoff receiver for partition ... exited abnormally after processing
2860338 objects: {{badarg,[{erlang,binary_to_term,..."
and
"bad argument in call to erlang:binary_to_term(<<131,104,...."

Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).


When I've printed corrupted binary string I found that it corresponds to
one value.

When I've tried to "get" it, it was read OK but node with corrupted value
shown the same binary_to_term error.

When I've tried to delete corrupted value I've got timeout.


I'm running machines with ECC memory and ZFS filesystem (which doesn't
report any checksum failures) so I doubt data was silently corrupted on
disk.

LOG from corresponding LevelDB partition doesn't show any errors. But there
is a lost/BLOCKS.bad file in this partition (7kb, created more than a month
ago and looks like it doesn't contain corrupted value).

At the moment I've stopped handoffs using "risk-admin transfer-limit 0".

Why the value was corrupted? It there any way to remove it or fix it?

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Corrupted Erlang binary term inside LevelDB

Reply via email to