Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Simon Effenberg Wed, 11 Dec 2013 05:41:34 -0800

Also some side notes:

"top" is even better on new 1.4.2 than on 1.3.1 machines.. IO
utilization of disk is mostly the same (round about 33%)..


but

95th percentile of response time for get (avg over all nodes):
  before upgrade: 29ms
  after upgrade: almost the same

95th percentile of response time for put (avg over all nodes):
  before upgrade: 60ms
  after upgrade: 1548ms 
    but this is only because of 2 of 12 nodes are
    on 1.4.2 and are really slow (17000ms)

Cheers,
Simon

On Wed, 11 Dec 2013 13:45:56 +0100
Simon Effenberg <[email protected]> wrote:

> Sorry I forgot the half of it..
> 
> seffenberg@kriak46-1:~$ free -m
>              total       used       free     shared    buffers cached
> Mem:         23999      23759        239          0        184      16183
> -/+ buffers/cache:       7391      16607
> Swap:            0          0          0
> 
> We have 12 servers..
> datadir on the compacted servers (1.4.2) ~ 765 GB
> 
> AAE is enabled.
> 
> I attached app.config and vm.args.
> 
> Cheers
> Simon
> 
> On Wed, 11 Dec 2013 07:33:31 -0500
> Matthew Von-Maszewski <[email protected]> wrote:
> 
> > Ok, I am now suspecting that your servers are either using swap space 
> > (which is slow) or your leveldb file cache is thrashing (opening and 
> > closing multiple files per request).
> > 
> > How many servers do you have and do you use Riak's active anti-entropy 
> > feature?  I am going to plug all of this into a spreadsheet. 
> > 
> > Matthew Von-Maszewski
> > 
> > 
> > On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> wrote:
> > 
> > > Hi Matthew
> > > 
> > > Memory: 23999 MB
> > > 
> > > ring_creation_size, 256
> > > max_open_files, 100
> > > 
> > > riak-admin status:
> > > 
> > > memory_total : 276001360
> > > memory_processes : 191506322
> > > memory_processes_used : 191439568
> > > memory_system : 84495038
> > > memory_atom : 686993
> > > memory_atom_used : 686560
> > > memory_binary : 21965352
> > > memory_code : 11332732
> > > memory_ets : 10823528
> > > 
> > > Thanks for looking!
> > > 
> > > Cheers
> > > Simon
> > > 
> > > 
> > > 
> > > On Wed, 11 Dec 2013 06:44:42 -0500
> > > Matthew Von-Maszewski <[email protected]> wrote:
> > > 
> > >> I need to ask other developers as they arrive for the new day.  Does not 
> > >> make sense to me.
> > >> 
> > >> How many nodes do you have?  How much RAM do you have in each node?  
> > >> What are your settings for max_open_files and cache_size in the 
> > >> app.config file?  Maybe this is as simple as leveldb using too much RAM 
> > >> in 1.4.  The memory accounting for maz_open_files changed in 1.4.
> > >> 
> > >> Matthew Von-Maszewski
> > >> 
> > >> 
> > >> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> 
> > >> wrote:
> > >> 
> > >>> Hi Matthew,
> > >>> 
> > >>> it took around 11hours for the first node to finish the compaction. The
> > >>> second node is running already 12 hours and is still doing compaction.
> > >>> 
> > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is
> > >>> much higher (after the compaction) than on an old 1.3.1 (both are
> > >>> running in the cluster right now and another one is doing the
> > >>> compaction/upgrade while it is in the cluster but not directly
> > >>> accessible because it is out of the Loadbalancer):
> > >>> 
> > >>> 1.4.2:
> > >>> 
> > >>> node_put_fsm_time_mean : 2208050
> > >>> node_put_fsm_time_median : 39231
> > >>> node_put_fsm_time_95 : 17400382
> > >>> node_put_fsm_time_99 : 50965752
> > >>> node_put_fsm_time_100 : 59537762
> > >>> node_put_fsm_active : 5
> > >>> node_put_fsm_active_60s : 364
> > >>> node_put_fsm_in_rate : 5
> > >>> node_put_fsm_out_rate : 3
> > >>> node_put_fsm_rejected : 0
> > >>> node_put_fsm_rejected_60s : 0
> > >>> node_put_fsm_rejected_total : 0
> > >>> 
> > >>> 
> > >>> 1.3.1:
> > >>> 
> > >>> node_put_fsm_time_mean : 5036
> > >>> node_put_fsm_time_median : 1614
> > >>> node_put_fsm_time_95 : 8789
> > >>> node_put_fsm_time_99 : 38258
> > >>> node_put_fsm_time_100 : 384372
> > >>> 
> > >>> 
> > >>> any clue why this could/should be?
> > >>> 
> > >>> Cheers
> > >>> Simon
> > >>> 
> > >>> On Tue, 10 Dec 2013 17:21:07 +0100
> > >>> Simon Effenberg <[email protected]> wrote:
> > >>> 
> > >>>> Hi Matthew,
> > >>>> 
> > >>>> thanks!.. that answers my questions!
> > >>>> 
> > >>>> Cheers
> > >>>> Simon
> > >>>> 
> > >>>> On Tue, 10 Dec 2013 11:08:32 -0500
> > >>>> Matthew Von-Maszewski <[email protected]> wrote:
> > >>>> 
> > >>>>> 2i is not my expertise, so I had to discuss you concerns with another 
> > >>>>> Basho developer.  He says:
> > >>>>> 
> > >>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk 
> > >>>>> format.  You must wait for all nodes to update if you desire to use 
> > >>>>> the new 2i query.  The 2i data will properly write/update on both 1.3 
> > >>>>> and 1.4 machines during the migration.
> > >>>>> 
> > >>>>> Does that answer your question?
> > >>>>> 
> > >>>>> 
> > >>>>> And yes, you might see available disk space increase during the 
> > >>>>> upgrade compactions if your dataset contains numerous delete 
> > >>>>> "tombstones".  The Riak 2.0 code includes a new feature called 
> > >>>>> "aggressive delete" for leveldb.  This feature is more proactive in 
> > >>>>> pushing delete tombstones through the levels to free up disk space 
> > >>>>> much more quickly (especially if you perform block deletes every now 
> > >>>>> and then).
> > >>>>> 
> > >>>>> Matthew
> > >>>>> 
> > >>>>> 
> > >>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg 
> > >>>>> <[email protected]> wrote:
> > >>>>> 
> > >>>>>> Hi Matthew,
> > >>>>>> 
> > >>>>>> see inline..
> > >>>>>> 
> > >>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
> > >>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> > >>>>>> 
> > >>>>>>> The sad truth is that you are not the first to see this problem.  
> > >>>>>>> And yes, it has to do with your 950GB per node dataset.  And no, 
> > >>>>>>> nothing to do but sit through it at this time.
> > >>>>>>> 
> > >>>>>>> While I did extensive testing around upgrade times before shipping 
> > >>>>>>> 1.4, apparently there are data configurations I did not anticipate. 
> > >>>>>>>  You are likely seeing a cascade where a shift of one file from 
> > >>>>>>> level-1 to level-2 is causing a shift of another file from level-2 
> > >>>>>>> to level-3, which causes a level-3 file to shift to level-4, etc … 
> > >>>>>>> then the next file shifts from level-1.
> > >>>>>>> 
> > >>>>>>> The bright side of this pain is that you will end up with better 
> > >>>>>>> write throughput once all the compaction ends.
> > >>>>>> 
> > >>>>>> I have to deal with that.. but my problem is now, if I'm doing this
> > >>>>>> node by node it looks like 2i searches aren't possible while 1.3 and
> > >>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads me 
> > >>>>>> to
> > >>>>>> an 2i repair marathon or could I easily wait for some hours for each
> > >>>>>> node until all merges are done before I upgrade the next one? (2i
> > >>>>>> searches can fail for some time.. the APP isn't having problems with
> > >>>>>> that but are new inserts with 2i indices processed successfully or do
> > >>>>>> I have to do the 2i repair?)
> > >>>>>> 
> > >>>>>> /s
> > >>>>>> 
> > >>>>>> one other good think: saving disk space is one advantage ;)..
> > >>>>>> 
> > >>>>>> 
> > >>>>>>> 
> > >>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, 
> > >>>>>>> but that is not going to help you today.
> > >>>>>>> 
> > >>>>>>> Matthew
> > >>>>>>> 
> > >>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg 
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>> 
> > >>>>>>>> Hi @list,
> > >>>>>>>> 
> > >>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after
> > >>>>>>>> upgrading the first node (out of 12) this node seems to do many 
> > >>>>>>>> merges.
> > >>>>>>>> the sst_* directories changes in size "rapidly" and the node is 
> > >>>>>>>> having
> > >>>>>>>> a disk utilization of 100% all the time.
> > >>>>>>>> 
> > >>>>>>>> I know that there is something like that:
> > >>>>>>>> 
> > >>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x 
> > >>>>>>>> dataset
> > >>>>>>>> will initiate an automatic conversion that could pause the startup 
> > >>>>>>>> of
> > >>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is 
> > >>>>>>>> being
> > >>>>>>>> adjusted such that "level #1" can operate as an overlapped data 
> > >>>>>>>> level
> > >>>>>>>> instead of as a sorted data level. The conversion is simply the
> > >>>>>>>> reduction of the number of files in "level #1" to being less than 
> > >>>>>>>> eight
> > >>>>>>>> via normal compaction of data from "level #1" into "level #2". 
> > >>>>>>>> This is
> > >>>>>>>> a one time conversion."
> > >>>>>>>> 
> > >>>>>>>> but it looks much more invasive than explained here or doesn't 
> > >>>>>>>> have to
> > >>>>>>>> do anything with the (probably seen) merges.
> > >>>>>>>> 
> > >>>>>>>> Is this "normal" behavior or could I do anything about it?
> > >>>>>>>> 
> > >>>>>>>> At the moment I'm stucked with the upgrade procedure because this 
> > >>>>>>>> high
> > >>>>>>>> IO load would probably lead to high response times.
> > >>>>>>>> 
> > >>>>>>>> Also we have a lot of data (per node ~950 GB).
> > >>>>>>>> 
> > >>>>>>>> Cheers
> > >>>>>>>> Simon
> > >>>>>>>> 
> > >>>>>>>> _______________________________________________
> > >>>>>>>> riak-users mailing list
> > >>>>>>>> [email protected]
> > >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >>>>>>> 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> -- 
> > >>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>>>>> Fon:     + 49-(0)30-8109 - 7173
> > >>>>>> Fax:     + 49-(0)30-8109 - 7131
> > >>>>>> 
> > >>>>>> Mail:     [email protected]
> > >>>>>> Web:    www.mobile.de
> > >>>>>> 
> > >>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>>>>> 
> > >>>>>> 
> > >>>>>> Geschäftsführer: Malte Krüger
> > >>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>>>>> Sitz der Gesellschaft: Kleinmachnow 
> > >>>>> 
> > >>>> 
> > >>>> 
> > >>>> -- 
> > >>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>>> Fon:     + 49-(0)30-8109 - 7173
> > >>>> Fax:     + 49-(0)30-8109 - 7131
> > >>>> 
> > >>>> Mail:     [email protected]
> > >>>> Web:    www.mobile.de
> > >>>> 
> > >>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>>> 
> > >>>> 
> > >>>> Geschäftsführer: Malte Krüger
> > >>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>>> Sitz der Gesellschaft: Kleinmachnow 
> > >>>> 
> > >>>> _______________________________________________
> > >>>> riak-users mailing list
> > >>>> [email protected]
> > >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >>> 
> > >>> 
> > >>> -- 
> > >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>> Fon:     + 49-(0)30-8109 - 7173
> > >>> Fax:     + 49-(0)30-8109 - 7131
> > >>> 
> > >>> Mail:     [email protected]
> > >>> Web:    www.mobile.de
> > >>> 
> > >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>> 
> > >>> 
> > >>> Geschäftsführer: Malte Krüger
> > >>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>> Sitz der Gesellschaft: Kleinmachnow 
> > > 
> > > 
> > > -- 
> > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > Fon:     + 49-(0)30-8109 - 7173
> > > Fax:     + 49-(0)30-8109 - 7131
> > > 
> > > Mail:     [email protected]
> > > Web:    www.mobile.de
> > > 
> > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > 
> > > 
> > > Geschäftsführer: Malte Krüger
> > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > Sitz der Gesellschaft: Kleinmachnow 
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     [email protected]
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     [email protected]
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Reply via email to