Arun,

You are running out of RAM for the leveldb AAE.  There are several ways to fix 
that:

- reduce memory allocated to bitcask
- more memory per server
- more servers of same memory
- reduce the ring size from 64 to 8, and rebuild data within the cluster from 
scratch
- lie to leveldb and give it a big than real memory setting in riak.conf:
        leveldb.maximum_memory=8G


The key LOG lines are:

Options.total_leveldb_mem: 2,901,766,963    <-- this is the total memory 
assigned to ALL of leveldb, but
    only 20% of it goes to AAE vnodes

File cache size: 5833527     <-- the first vnode says, cool enough memory for me
Block cache size: 7930679  <-- ditto

  ... but as more vnodes start:

 File cache size: 0                <-- things are just not going to work well
Block cache size: 0

There are no actual file system error messages in your LOG files.  That 
supports that the real problem is memory unhappiness.

Matthew


> On Feb 14, 2017, at 3:34 PM, Arun Rajagopalan <[email protected]> 
> wrote:
> 
> Hi Matthew, Magnus
> 
> I have attached the log files for your review
> 
> Thanks
> Arun
> 
> 
> On Tue, Feb 14, 2017 at 11:55 AM, Matthew Von-Maszewski <[email protected] 
> <mailto:[email protected]>> wrote:
> Arun,
> 
> The AAE code uses leveldb for its storage of anti-entropy data, no matter 
> which backend holds the user data.  Therefore the error below suggests 
> corruption within leveldb files (which is not impossible, but becoming really 
> rare except with bad hardware or full disks).
> 
> Before wiping out the AAE directory, you should copy the LOG file within it.  
> There are likely more useful error messages within that file ... maybe put 
> the file in drop box or zip attach to a reply for us to review.
> 
> Matthew
> 
>> On Feb 14, 2017, at 10:42 AM, Magnus Kessler <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On 14 February 2017 at 14:46, Arun Rajagopalan <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Magnus
>> 
>> RIAK crashes on startup when I have trucated bitcask file
>> 
>> It also crashes when the AAE files are bad too I think. Example below
>> 
>> 2017-02-13 21:18:30 =CRASH REPORT====
>>   crasher:
>>     initial call: riak_kv_index_hashtree:init/1
>>     pid: <0.6037.0>
>>     registered_name: []
>>     exception exit: {{{badmatch,{error,{db_open,"Corruption: truncated 
>> record at end of file"}}},[{hashtree,new_segment_
>> store,2,[{file,"src/hashtree.erl"},{line,675}]},{hashtree,new,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_h
>> ashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,610}]},{lists,foldl,3,[{file,"lists.erl"},{line,124
>> 8}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_kv_index_hashtree.erl"},{line,474}]},{riak_kv_index_hashtree,
>> init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,268}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]}
>> ,{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line
>> ,328}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
>>     ancestors: [<0.715.0>,riak_core_vnode_sup,riak_core_sup,<0.160.0>]
>>     messages: []
>>     links: []
>>     dictionary: []
>>     trap_exit: false
>>     status: running
>>     heap_size: 1598
>>     stack_size: 27
>>     reductions: 889
>>   neighbours:
>> 
>> 
>> Regards
>> Arun
>> 
>> 
>> Hi Arun,
>> 
>> The crash log you provided shows that there is a corrupted file in the AAE 
>> (anti_entropy) backend. Entries in console.log should have more information 
>> about which partition is affected. Please post output from the affected node 
>> at around 2017-02-13T21:18:30. As this is AAE data, it is safe to remove the 
>> directory named after the affected partition from the active_entropy 
>> directory before restarting the node. You may find that there is more than 
>> one affected partition, the next of which will be encountered after the 
>> attempted restart only. If this is the case, simply identify the next 
>> partition in the same way and remove it, too, until the node starts up 
>> successfully again.
>> 
>> Is there a reason why the nodes aren't shut down in the regular way?
>> 
>> Kind Regards,
>> 
>> Magnus
>> 
>> 
>> 
>> -- 
>> Magnus Kessler
>> Client Services Engineer
>> Basho Technologies Limited
>> 
>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
>> _______________________________________________
>> riak-users mailing list
>> [email protected] <mailto:[email protected]>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 
>> <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
> 
> 
> <aaeLOG.tar>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to