Re: Riak cluster-f#$%

Michael Truog Mon, 01 Oct 2012 14:08:07 -0700

Still, doesn't that failure show a typical overload of Riak's usage of 
mochiglobal (i.e., the code_server needing to lock all Erlang schedulers)?  I 
understand that running more than one node on a single machine is not realistic 
deployment.  However, I don't see why it would cause errors, unless Riak was 
unable to handle the requests incoming.


On 10/01/2012 01:54 PM, Alexander Sicular wrote:
> Any time you overload one box you run into all sorts of i/o dreck, screw with 
> your conf files and mess with your versions you just have too many variables 
> in the mix to get anything meaningful out of what you were trying to do. 
> Since this is a test just tear the whole thing down and start clean. 
>
> If you want to dev test your app just use one node and dial the n val down to 
> one in the app.config, which isn't actually there so you'll have to add it 
> manually to the riak_core section like so (with some other stuff):
>
> {default_bucket_props, [{n_val,1},
>    {allow_mult,false},
>    {last_write_wins,false},
>    {precommit, []},
>    {postcommit, []},
>    {chash_keyfun, {riak_core_util, chash_std_keyfun}}
> ]}  
>
> (Hey Basho people, that stuff should be in the app.config file by default. 
> Making people go fish for it and figure out how and where to add this stuff 
> is kinda unnecessary. Here is an example of a great conf file with everything 
> you can conf and a whole bunch of docs: 
> https://github.com/antirez/redis/blob/unstable/redis.conf ).
>
> If you want to performance test your app make your dev system as similar to 
> your prod system as possible and knock it out.
>
>
> -Alexander Sicular
>
> @siculars
>
> On Oct 1, 2012, at 4:30 PM, Callixte Cauchois wrote:
>
>> Thank you, but can you explain a bit more?
>> I mean I understand why it is a bad thing with regards to reliability and in 
>> case of hardware issues. But does it have also an impact on the behaviour 
>> when the hardware is performing correctly and the load on the machines are 
>> the same?
>>
>> On Mon, Oct 1, 2012 at 1:25 PM, Alexander Sicular <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>>     Inline.
>>
>>     -Alexander Sicular
>>
>>     @siculars
>>
>>     On Oct 1, 2012, at 3:23 PM, Callixte Cauchois wrote:
>>
>>     > Hi there,
>>     >
>>     > so, I am currently evaluating Riak to see how it can fit in our 
>> platform. To do so I have set up a cluster of 4 nodes on SmartOS, all of 
>> them on the same physical box.
>>
>>     Mistake. Just stop here. Everything else doesn't matter. Do not put all 
>> your virtual machines (riak nodes) on one physical machine. Put em on 
>> different physical machines. Fix the config files and try again.
>>
>>     > I then built a simple application in node.js that get log events from 
>> our production system through a RabbitMQ queue and store them in my cluster. 
>> I let Riak generate the ids, but I have added two secondary indices to be 
>> able to retrieve more easily all the log events that belong to a single 
>> session.
>>     > Everything was going fine, events come around 130 messages per second 
>> are easily ingested by Riak. When stop it and then restart it, there is a 
>> bit of an issue as the events are read from the queue at 1500 messages per 
>> second and the insertion times go up, so I need some retries to actually 
>> store everything.
>>     > I wanted to tweak the LevelDB params to increase the throughput. To do 
>> so, I first upgraded from 1.1.6 to 1.2.0. I chose what I thought was the 
>> safest way: node by node, I have them leave the cluster, then I upgrade, 
>> then join again. During the whole process I kept inserting.
>>     > It went quite well. But, when I ran some queries using 2i, it gave me 
>> errors and I realized that for two of my four nodes, I forgot to put back 
>> eLevelDB as the default engine. As soon as I ran this query, everything went 
>> havoc, a lot of inserts failed, some nodes where not reachable using the 
>> ping url.
>>     > I changed the default engine and restarted those nodes, nothing 
>> changed. I tried to make them leave the cluster, after two days, they are 
>> still leaving. Riak-admin transfers tells that a lot of transfers need to 
>> occur, but the system is stuck: the numbers there do not change.
>>     >
>>     > I guess I have done several things wrong. It is test data, so it 
>> doesn't really matter if I loose data or if I have to re-start from scratch, 
>> but I want to understand what have gone wrong how I could have fixed it. Or 
>> if I even can recover from there now.
>>     >
>>     > Thank you.
>>     > C.
>>     > _______________________________________________
>>     > riak-users mailing list
>>     > [email protected] <mailto:[email protected]>
>>     > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak cluster-f#$%

Reply via email to