Vladimir, You have one of two problems, either there is something in one of the existing .sst files causing compaction to crash, or there is some type of threading problem that crash during the hash/bloom creation. I am going to assume an .sst file problem for now.
The last time you crashed, only 6 vnodes were actively compacting. Their directory names start as follows: 21694 23977 473846 65082 7478777 7707137 I leave it to you to copy and paste the full directory names. The first step is to find which directory is crashing. Use the instructions here https://gist.github.com/gburd/b88aee6da7fee81dc036 BUT instead of "eleveldb:repair", use "eleveldb:open". Wait five minutes after each open to see if each vnode's compactions run completely without crashing. If one crashes, that is where the problem exists. If none of them crash, then there is a threading / race condition … and I will have to think on how to diagnose. Matthew On Jun 12, 2013, at 6:09 PM, Vladimir Shabanov <[email protected]> wrote: > Unfortunately I don't see any 'Compaction error' in my logs. > > Should I run eleveldb:repair() on all partitions? And if should how I can run > a function for all files in directory in Erlang? Running function manually > for all partitions is too tiresome. > > > 2013/6/13 Matthew Von-Maszewski <[email protected]> > Vladimir, > > Also, my colleague sent this to me after my first email. This is roughly the > plan I intend to follow … if the LOG file gives us a direct pointer to a file > corruption: > > https://gist.github.com/gburd/b88aee6da7fee81dc036 > > However, your crash point suggests we might have to do a bit more work to > isolate the bad input file. But I would be happy to be wrong and the above > work as is. > > Matthew > > > On Jun 12, 2013, at 5:13 PM, Matthew Von-Maszewski <[email protected]> wrote: > >> Vladimir, >> >> I asked around the Basho chat room and you have a crash that has never been >> seen. This should be interesting. >> >> The crash is happening during a compaction, specifically during the creation >> of the bloom filter for a new .sst file. Maybe we can isolate the old file >> that feeding this compaction and move it out of the way for further >> debugging … and get you running while the debugging happens off-line. >> >> Would you tar/zip the following files (changing the paths as appropriate for >> your system): >> >> tar -czf vladimir_LOGs.tgz /var/lib/riak/leveldb/*/LOG* >> and your app.config file. >> >> I will see if I can determine where the bad input file resides and help you >> get back running. Then we can decide how to look deeper for root cause. >> >> Matthew >> >> >> On Jun 12, 2013, at 4:02 PM, Vladimir Shabanov <[email protected]> wrote: >> >>> Hello, >>> >>> I have a cluster of 8 Riak-1.3.1 nodes. Recently one of my nodes silently >>> crashed. Nothing unusual was reported in logs. >>> >>> When I've tried to start my node again it worked for few seconds and >>> silently crashed again. I've run 'riak console' and seen "Segmentation >>> fault". >>> >>> gdb with dumped core shows: >>> >>> Program terminated with signal 11, Segmentation fault. >>> #0 0x00007f162547fa30 in MurmurHash64A(void const*, int, unsigned int) () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> >>> Backtrace shows that it happens somewhere in LevelDB compaction. >>> >>> (gdb) bt >>> #0 0x00007f162547fa30 in MurmurHash64A(void const*, int, unsigned int) () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #1 0x00007f162547833c in leveldb::(anonymous >>> namespace)::BloomFilterPolicy2::CreateFilter(leveldb::Slice const*, int, >>> std::string*) const () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #2 0x00007f162548382d in leveldb::FilterBlockBuilder::GenerateFilter() () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #3 0x00007f1625483a58 in leveldb::FilterBlockBuilder::StartBlock(unsigned >>> long) () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #4 0x00007f1625475175 in leveldb::TableBuilder::Flush() () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #5 0x00007f1625475395 in leveldb::TableBuilder::Add(leveldb::Slice const&, >>> leveldb::Slice const&) () from >>> /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #6 0x00007f162545b561 in >>> leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #7 0x00007f162545bd3b in leveldb::DBImpl::BackgroundCompaction() () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #8 0x00007f162545ca5d in leveldb::DBImpl::BackgroundCall() () >>> from /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #9 0x00007f162547bb38 in leveldb::(anonymous >>> namespace)::PosixEnv::BGThreadWrapper(void*) () from >>> /tank/riak-1.3.1/lib/eleveldb-1.3.0/priv/eleveldb.so >>> #10 0x00007f163366ab50 in start_thread () from >>> /lib/x86_64-linux-gnu/libpthread.so.0 >>> #11 0x00007f16331aca7d in clone () from /lib/x86_64-linux-gnu/libc.so.6 >>> #12 0x0000000000000000 in ?? () >>> >>> gdb output in gist >>> https://gist.github.com/vshabanov/5768546 >>> >>> Why it's happening and how to bring the node back to life? >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
