Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Simon Effenberg Wed, 11 Dec 2013 04:47:29 -0800

Sorry I forgot the half of it..

seffenberg@kriak46-1:~$ free -m
             total       used       free     shared    buffers cached
Mem:         23999      23759        239          0        184      16183
-/+ buffers/cache:       7391      16607
Swap:            0          0          0


We have 12 servers..
datadir on the compacted servers (1.4.2) ~ 765 GB

AAE is enabled.

I attached app.config and vm.args.

Cheers
Simon

On Wed, 11 Dec 2013 07:33:31 -0500
Matthew Von-Maszewski <[email protected]> wrote:

> Ok, I am now suspecting that your servers are either using swap space (which 
> is slow) or your leveldb file cache is thrashing (opening and closing 
> multiple files per request).
> 
> How many servers do you have and do you use Riak's active anti-entropy 
> feature?  I am going to plug all of this into a spreadsheet. 
> 
> Matthew Von-Maszewski
> 
> 
> On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> wrote:
> 
> > Hi Matthew
> > 
> > Memory: 23999 MB
> > 
> > ring_creation_size, 256
> > max_open_files, 100
> > 
> > riak-admin status:
> > 
> > memory_total : 276001360
> > memory_processes : 191506322
> > memory_processes_used : 191439568
> > memory_system : 84495038
> > memory_atom : 686993
> > memory_atom_used : 686560
> > memory_binary : 21965352
> > memory_code : 11332732
> > memory_ets : 10823528
> > 
> > Thanks for looking!
> > 
> > Cheers
> > Simon
> > 
> > 
> > 
> > On Wed, 11 Dec 2013 06:44:42 -0500
> > Matthew Von-Maszewski <[email protected]> wrote:
> > 
> >> I need to ask other developers as they arrive for the new day.  Does not 
> >> make sense to me.
> >> 
> >> How many nodes do you have?  How much RAM do you have in each node?  What 
> >> are your settings for max_open_files and cache_size in the app.config 
> >> file?  Maybe this is as simple as leveldb using too much RAM in 1.4.  The 
> >> memory accounting for maz_open_files changed in 1.4.
> >> 
> >> Matthew Von-Maszewski
> >> 
> >> 
> >> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> 
> >> wrote:
> >> 
> >>> Hi Matthew,
> >>> 
> >>> it took around 11hours for the first node to finish the compaction. The
> >>> second node is running already 12 hours and is still doing compaction.
> >>> 
> >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is
> >>> much higher (after the compaction) than on an old 1.3.1 (both are
> >>> running in the cluster right now and another one is doing the
> >>> compaction/upgrade while it is in the cluster but not directly
> >>> accessible because it is out of the Loadbalancer):
> >>> 
> >>> 1.4.2:
> >>> 
> >>> node_put_fsm_time_mean : 2208050
> >>> node_put_fsm_time_median : 39231
> >>> node_put_fsm_time_95 : 17400382
> >>> node_put_fsm_time_99 : 50965752
> >>> node_put_fsm_time_100 : 59537762
> >>> node_put_fsm_active : 5
> >>> node_put_fsm_active_60s : 364
> >>> node_put_fsm_in_rate : 5
> >>> node_put_fsm_out_rate : 3
> >>> node_put_fsm_rejected : 0
> >>> node_put_fsm_rejected_60s : 0
> >>> node_put_fsm_rejected_total : 0
> >>> 
> >>> 
> >>> 1.3.1:
> >>> 
> >>> node_put_fsm_time_mean : 5036
> >>> node_put_fsm_time_median : 1614
> >>> node_put_fsm_time_95 : 8789
> >>> node_put_fsm_time_99 : 38258
> >>> node_put_fsm_time_100 : 384372
> >>> 
> >>> 
> >>> any clue why this could/should be?
> >>> 
> >>> Cheers
> >>> Simon
> >>> 
> >>> On Tue, 10 Dec 2013 17:21:07 +0100
> >>> Simon Effenberg <[email protected]> wrote:
> >>> 
> >>>> Hi Matthew,
> >>>> 
> >>>> thanks!.. that answers my questions!
> >>>> 
> >>>> Cheers
> >>>> Simon
> >>>> 
> >>>> On Tue, 10 Dec 2013 11:08:32 -0500
> >>>> Matthew Von-Maszewski <[email protected]> wrote:
> >>>> 
> >>>>> 2i is not my expertise, so I had to discuss you concerns with another 
> >>>>> Basho developer.  He says:
> >>>>> 
> >>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk 
> >>>>> format.  You must wait for all nodes to update if you desire to use the 
> >>>>> new 2i query.  The 2i data will properly write/update on both 1.3 and 
> >>>>> 1.4 machines during the migration.
> >>>>> 
> >>>>> Does that answer your question?
> >>>>> 
> >>>>> 
> >>>>> And yes, you might see available disk space increase during the upgrade 
> >>>>> compactions if your dataset contains numerous delete "tombstones".  The 
> >>>>> Riak 2.0 code includes a new feature called "aggressive delete" for 
> >>>>> leveldb.  This feature is more proactive in pushing delete tombstones 
> >>>>> through the levels to free up disk space much more quickly (especially 
> >>>>> if you perform block deletes every now and then).
> >>>>> 
> >>>>> Matthew
> >>>>> 
> >>>>> 
> >>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg 
> >>>>> <[email protected]> wrote:
> >>>>> 
> >>>>>> Hi Matthew,
> >>>>>> 
> >>>>>> see inline..
> >>>>>> 
> >>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
> >>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> >>>>>> 
> >>>>>>> The sad truth is that you are not the first to see this problem.  And 
> >>>>>>> yes, it has to do with your 950GB per node dataset.  And no, nothing 
> >>>>>>> to do but sit through it at this time.
> >>>>>>> 
> >>>>>>> While I did extensive testing around upgrade times before shipping 
> >>>>>>> 1.4, apparently there are data configurations I did not anticipate.  
> >>>>>>> You are likely seeing a cascade where a shift of one file from 
> >>>>>>> level-1 to level-2 is causing a shift of another file from level-2 to 
> >>>>>>> level-3, which causes a level-3 file to shift to level-4, etc … then 
> >>>>>>> the next file shifts from level-1.
> >>>>>>> 
> >>>>>>> The bright side of this pain is that you will end up with better 
> >>>>>>> write throughput once all the compaction ends.
> >>>>>> 
> >>>>>> I have to deal with that.. but my problem is now, if I'm doing this
> >>>>>> node by node it looks like 2i searches aren't possible while 1.3 and
> >>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads me to
> >>>>>> an 2i repair marathon or could I easily wait for some hours for each
> >>>>>> node until all merges are done before I upgrade the next one? (2i
> >>>>>> searches can fail for some time.. the APP isn't having problems with
> >>>>>> that but are new inserts with 2i indices processed successfully or do
> >>>>>> I have to do the 2i repair?)
> >>>>>> 
> >>>>>> /s
> >>>>>> 
> >>>>>> one other good think: saving disk space is one advantage ;)..
> >>>>>> 
> >>>>>> 
> >>>>>>> 
> >>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, 
> >>>>>>> but that is not going to help you today.
> >>>>>>> 
> >>>>>>> Matthew
> >>>>>>> 
> >>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg 
> >>>>>>> <[email protected]> wrote:
> >>>>>>> 
> >>>>>>>> Hi @list,
> >>>>>>>> 
> >>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after
> >>>>>>>> upgrading the first node (out of 12) this node seems to do many 
> >>>>>>>> merges.
> >>>>>>>> the sst_* directories changes in size "rapidly" and the node is 
> >>>>>>>> having
> >>>>>>>> a disk utilization of 100% all the time.
> >>>>>>>> 
> >>>>>>>> I know that there is something like that:
> >>>>>>>> 
> >>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset
> >>>>>>>> will initiate an automatic conversion that could pause the startup of
> >>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being
> >>>>>>>> adjusted such that "level #1" can operate as an overlapped data level
> >>>>>>>> instead of as a sorted data level. The conversion is simply the
> >>>>>>>> reduction of the number of files in "level #1" to being less than 
> >>>>>>>> eight
> >>>>>>>> via normal compaction of data from "level #1" into "level #2". This 
> >>>>>>>> is
> >>>>>>>> a one time conversion."
> >>>>>>>> 
> >>>>>>>> but it looks much more invasive than explained here or doesn't have 
> >>>>>>>> to
> >>>>>>>> do anything with the (probably seen) merges.
> >>>>>>>> 
> >>>>>>>> Is this "normal" behavior or could I do anything about it?
> >>>>>>>> 
> >>>>>>>> At the moment I'm stucked with the upgrade procedure because this 
> >>>>>>>> high
> >>>>>>>> IO load would probably lead to high response times.
> >>>>>>>> 
> >>>>>>>> Also we have a lot of data (per node ~950 GB).
> >>>>>>>> 
> >>>>>>>> Cheers
> >>>>>>>> Simon
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> riak-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> -- 
> >>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>>>> Fon:     + 49-(0)30-8109 - 7173
> >>>>>> Fax:     + 49-(0)30-8109 - 7131
> >>>>>> 
> >>>>>> Mail:     [email protected]
> >>>>>> Web:    www.mobile.de
> >>>>>> 
> >>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>>>> 
> >>>>>> 
> >>>>>> Geschäftsführer: Malte Krüger
> >>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>>>> Sitz der Gesellschaft: Kleinmachnow 
> >>>>> 
> >>>> 
> >>>> 
> >>>> -- 
> >>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>> Fon:     + 49-(0)30-8109 - 7173
> >>>> Fax:     + 49-(0)30-8109 - 7131
> >>>> 
> >>>> Mail:     [email protected]
> >>>> Web:    www.mobile.de
> >>>> 
> >>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>> 
> >>>> 
> >>>> Geschäftsführer: Malte Krüger
> >>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>> Sitz der Gesellschaft: Kleinmachnow 
> >>>> 
> >>>> _______________________________________________
> >>>> riak-users mailing list
> >>>> [email protected]
> >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>> 
> >>> 
> >>> -- 
> >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>> Fon:     + 49-(0)30-8109 - 7173
> >>> Fax:     + 49-(0)30-8109 - 7131
> >>> 
> >>> Mail:     [email protected]
> >>> Web:    www.mobile.de
> >>> 
> >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>> 
> >>> 
> >>> Geschäftsführer: Malte Krüger
> >>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>> Sitz der Gesellschaft: Kleinmachnow 
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     [email protected]
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     [email protected]
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow

app.config
Description: Binary data

vm.args
Description: Binary data

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Reply via email to