Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Simon Effenberg Wed, 11 Dec 2013 05:45:29 -0800

Oh and at the moment they are waiting for some handoffs and I see
errors in logfiles:



2013-12-11 13:41:47.948 UTC [error]
<0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff
transfer of riak_kv_vnode from '[email protected]'
468137243207554840987117797979434404733540892672

but I remember that somebody else had this as well and if I recall
correctly it disappeared after the full upgrade was done.. but at the
moment it's hard to think about upgrading everything at once..
(~12hours 100% disk utilization on all 12 nodes will lead to real slow
puts/gets)

What can I do?

Cheers
Simon

PS: transfers output:

'[email protected]' waiting to handoff 17 partitions
'[email protected]' waiting to handoff 19 partitions

(these are the 1.4.2 nodes)


On Wed, 11 Dec 2013 14:39:58 +0100
Simon Effenberg <[email protected]> wrote:

> Also some side notes:
> 
> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO
> utilization of disk is mostly the same (round about 33%)..
> 
> but
> 
> 95th percentile of response time for get (avg over all nodes):
>   before upgrade: 29ms
>   after upgrade: almost the same
> 
> 95th percentile of response time for put (avg over all nodes):
>   before upgrade: 60ms
>   after upgrade: 1548ms 
>     but this is only because of 2 of 12 nodes are
>     on 1.4.2 and are really slow (17000ms)
> 
> Cheers,
> Simon
> 
> On Wed, 11 Dec 2013 13:45:56 +0100
> Simon Effenberg <[email protected]> wrote:
> 
> > Sorry I forgot the half of it..
> > 
> > seffenberg@kriak46-1:~$ free -m
> >              total       used       free     shared    buffers cached
> > Mem:         23999      23759        239          0        184      16183
> > -/+ buffers/cache:       7391      16607
> > Swap:            0          0          0
> > 
> > We have 12 servers..
> > datadir on the compacted servers (1.4.2) ~ 765 GB
> > 
> > AAE is enabled.
> > 
> > I attached app.config and vm.args.
> > 
> > Cheers
> > Simon
> > 
> > On Wed, 11 Dec 2013 07:33:31 -0500
> > Matthew Von-Maszewski <[email protected]> wrote:
> > 
> > > Ok, I am now suspecting that your servers are either using swap space 
> > > (which is slow) or your leveldb file cache is thrashing (opening and 
> > > closing multiple files per request).
> > > 
> > > How many servers do you have and do you use Riak's active anti-entropy 
> > > feature?  I am going to plug all of this into a spreadsheet. 
> > > 
> > > Matthew Von-Maszewski
> > > 
> > > 
> > > On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> 
> > > wrote:
> > > 
> > > > Hi Matthew
> > > > 
> > > > Memory: 23999 MB
> > > > 
> > > > ring_creation_size, 256
> > > > max_open_files, 100
> > > > 
> > > > riak-admin status:
> > > > 
> > > > memory_total : 276001360
> > > > memory_processes : 191506322
> > > > memory_processes_used : 191439568
> > > > memory_system : 84495038
> > > > memory_atom : 686993
> > > > memory_atom_used : 686560
> > > > memory_binary : 21965352
> > > > memory_code : 11332732
> > > > memory_ets : 10823528
> > > > 
> > > > Thanks for looking!
> > > > 
> > > > Cheers
> > > > Simon
> > > > 
> > > > 
> > > > 
> > > > On Wed, 11 Dec 2013 06:44:42 -0500
> > > > Matthew Von-Maszewski <[email protected]> wrote:
> > > > 
> > > >> I need to ask other developers as they arrive for the new day.  Does 
> > > >> not make sense to me.
> > > >> 
> > > >> How many nodes do you have?  How much RAM do you have in each node?  
> > > >> What are your settings for max_open_files and cache_size in the 
> > > >> app.config file?  Maybe this is as simple as leveldb using too much 
> > > >> RAM in 1.4.  The memory accounting for maz_open_files changed in 1.4.
> > > >> 
> > > >> Matthew Von-Maszewski
> > > >> 
> > > >> 
> > > >> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> 
> > > >> wrote:
> > > >> 
> > > >>> Hi Matthew,
> > > >>> 
> > > >>> it took around 11hours for the first node to finish the compaction. 
> > > >>> The
> > > >>> second node is running already 12 hours and is still doing compaction.
> > > >>> 
> > > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host 
> > > >>> is
> > > >>> much higher (after the compaction) than on an old 1.3.1 (both are
> > > >>> running in the cluster right now and another one is doing the
> > > >>> compaction/upgrade while it is in the cluster but not directly
> > > >>> accessible because it is out of the Loadbalancer):
> > > >>> 
> > > >>> 1.4.2:
> > > >>> 
> > > >>> node_put_fsm_time_mean : 2208050
> > > >>> node_put_fsm_time_median : 39231
> > > >>> node_put_fsm_time_95 : 17400382
> > > >>> node_put_fsm_time_99 : 50965752
> > > >>> node_put_fsm_time_100 : 59537762
> > > >>> node_put_fsm_active : 5
> > > >>> node_put_fsm_active_60s : 364
> > > >>> node_put_fsm_in_rate : 5
> > > >>> node_put_fsm_out_rate : 3
> > > >>> node_put_fsm_rejected : 0
> > > >>> node_put_fsm_rejected_60s : 0
> > > >>> node_put_fsm_rejected_total : 0
> > > >>> 
> > > >>> 
> > > >>> 1.3.1:
> > > >>> 
> > > >>> node_put_fsm_time_mean : 5036
> > > >>> node_put_fsm_time_median : 1614
> > > >>> node_put_fsm_time_95 : 8789
> > > >>> node_put_fsm_time_99 : 38258
> > > >>> node_put_fsm_time_100 : 384372
> > > >>> 
> > > >>> 
> > > >>> any clue why this could/should be?
> > > >>> 
> > > >>> Cheers
> > > >>> Simon
> > > >>> 
> > > >>> On Tue, 10 Dec 2013 17:21:07 +0100
> > > >>> Simon Effenberg <[email protected]> wrote:
> > > >>> 
> > > >>>> Hi Matthew,
> > > >>>> 
> > > >>>> thanks!.. that answers my questions!
> > > >>>> 
> > > >>>> Cheers
> > > >>>> Simon
> > > >>>> 
> > > >>>> On Tue, 10 Dec 2013 11:08:32 -0500
> > > >>>> Matthew Von-Maszewski <[email protected]> wrote:
> > > >>>> 
> > > >>>>> 2i is not my expertise, so I had to discuss you concerns with 
> > > >>>>> another Basho developer.  He says:
> > > >>>>> 
> > > >>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk 
> > > >>>>> format.  You must wait for all nodes to update if you desire to use 
> > > >>>>> the new 2i query.  The 2i data will properly write/update on both 
> > > >>>>> 1.3 and 1.4 machines during the migration.
> > > >>>>> 
> > > >>>>> Does that answer your question?
> > > >>>>> 
> > > >>>>> 
> > > >>>>> And yes, you might see available disk space increase during the 
> > > >>>>> upgrade compactions if your dataset contains numerous delete 
> > > >>>>> "tombstones".  The Riak 2.0 code includes a new feature called 
> > > >>>>> "aggressive delete" for leveldb.  This feature is more proactive in 
> > > >>>>> pushing delete tombstones through the levels to free up disk space 
> > > >>>>> much more quickly (especially if you perform block deletes every 
> > > >>>>> now and then).
> > > >>>>> 
> > > >>>>> Matthew
> > > >>>>> 
> > > >>>>> 
> > > >>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg 
> > > >>>>> <[email protected]> wrote:
> > > >>>>> 
> > > >>>>>> Hi Matthew,
> > > >>>>>> 
> > > >>>>>> see inline..
> > > >>>>>> 
> > > >>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
> > > >>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> > > >>>>>> 
> > > >>>>>>> The sad truth is that you are not the first to see this problem.  
> > > >>>>>>> And yes, it has to do with your 950GB per node dataset.  And no, 
> > > >>>>>>> nothing to do but sit through it at this time.
> > > >>>>>>> 
> > > >>>>>>> While I did extensive testing around upgrade times before 
> > > >>>>>>> shipping 1.4, apparently there are data configurations I did not 
> > > >>>>>>> anticipate.  You are likely seeing a cascade where a shift of one 
> > > >>>>>>> file from level-1 to level-2 is causing a shift of another file 
> > > >>>>>>> from level-2 to level-3, which causes a level-3 file to shift to 
> > > >>>>>>> level-4, etc … then the next file shifts from level-1.
> > > >>>>>>> 
> > > >>>>>>> The bright side of this pain is that you will end up with better 
> > > >>>>>>> write throughput once all the compaction ends.
> > > >>>>>> 
> > > >>>>>> I have to deal with that.. but my problem is now, if I'm doing this
> > > >>>>>> node by node it looks like 2i searches aren't possible while 1.3 
> > > >>>>>> and
> > > >>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads 
> > > >>>>>> me to
> > > >>>>>> an 2i repair marathon or could I easily wait for some hours for 
> > > >>>>>> each
> > > >>>>>> node until all merges are done before I upgrade the next one? (2i
> > > >>>>>> searches can fail for some time.. the APP isn't having problems 
> > > >>>>>> with
> > > >>>>>> that but are new inserts with 2i indices processed successfully or 
> > > >>>>>> do
> > > >>>>>> I have to do the 2i repair?)
> > > >>>>>> 
> > > >>>>>> /s
> > > >>>>>> 
> > > >>>>>> one other good think: saving disk space is one advantage ;)..
> > > >>>>>> 
> > > >>>>>> 
> > > >>>>>>> 
> > > >>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction 
> > > >>>>>>> cascades, but that is not going to help you today.
> > > >>>>>>> 
> > > >>>>>>> Matthew
> > > >>>>>>> 
> > > >>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg 
> > > >>>>>>> <[email protected]> wrote:
> > > >>>>>>> 
> > > >>>>>>>> Hi @list,
> > > >>>>>>>> 
> > > >>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. 
> > > >>>>>>>> after
> > > >>>>>>>> upgrading the first node (out of 12) this node seems to do many 
> > > >>>>>>>> merges.
> > > >>>>>>>> the sst_* directories changes in size "rapidly" and the node is 
> > > >>>>>>>> having
> > > >>>>>>>> a disk utilization of 100% all the time.
> > > >>>>>>>> 
> > > >>>>>>>> I know that there is something like that:
> > > >>>>>>>> 
> > > >>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x 
> > > >>>>>>>> dataset
> > > >>>>>>>> will initiate an automatic conversion that could pause the 
> > > >>>>>>>> startup of
> > > >>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is 
> > > >>>>>>>> being
> > > >>>>>>>> adjusted such that "level #1" can operate as an overlapped data 
> > > >>>>>>>> level
> > > >>>>>>>> instead of as a sorted data level. The conversion is simply the
> > > >>>>>>>> reduction of the number of files in "level #1" to being less 
> > > >>>>>>>> than eight
> > > >>>>>>>> via normal compaction of data from "level #1" into "level #2". 
> > > >>>>>>>> This is
> > > >>>>>>>> a one time conversion."
> > > >>>>>>>> 
> > > >>>>>>>> but it looks much more invasive than explained here or doesn't 
> > > >>>>>>>> have to
> > > >>>>>>>> do anything with the (probably seen) merges.
> > > >>>>>>>> 
> > > >>>>>>>> Is this "normal" behavior or could I do anything about it?
> > > >>>>>>>> 
> > > >>>>>>>> At the moment I'm stucked with the upgrade procedure because 
> > > >>>>>>>> this high
> > > >>>>>>>> IO load would probably lead to high response times.
> > > >>>>>>>> 
> > > >>>>>>>> Also we have a lot of data (per node ~950 GB).
> > > >>>>>>>> 
> > > >>>>>>>> Cheers
> > > >>>>>>>> Simon
> > > >>>>>>>> 
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> riak-users mailing list
> > > >>>>>>>> [email protected]
> > > >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > >>>>>>> 
> > > >>>>>> 
> > > >>>>>> 
> > > >>>>>> -- 
> > > >>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > >>>>>> Fon:     + 49-(0)30-8109 - 7173
> > > >>>>>> Fax:     + 49-(0)30-8109 - 7131
> > > >>>>>> 
> > > >>>>>> Mail:     [email protected]
> > > >>>>>> Web:    www.mobile.de
> > > >>>>>> 
> > > >>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > >>>>>> 
> > > >>>>>> 
> > > >>>>>> Geschäftsführer: Malte Krüger
> > > >>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > >>>>>> Sitz der Gesellschaft: Kleinmachnow 
> > > >>>>> 
> > > >>>> 
> > > >>>> 
> > > >>>> -- 
> > > >>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > >>>> Fon:     + 49-(0)30-8109 - 7173
> > > >>>> Fax:     + 49-(0)30-8109 - 7131
> > > >>>> 
> > > >>>> Mail:     [email protected]
> > > >>>> Web:    www.mobile.de
> > > >>>> 
> > > >>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > >>>> 
> > > >>>> 
> > > >>>> Geschäftsführer: Malte Krüger
> > > >>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > >>>> Sitz der Gesellschaft: Kleinmachnow 
> > > >>>> 
> > > >>>> _______________________________________________
> > > >>>> riak-users mailing list
> > > >>>> [email protected]
> > > >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > >>> 
> > > >>> 
> > > >>> -- 
> > > >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > >>> Fon:     + 49-(0)30-8109 - 7173
> > > >>> Fax:     + 49-(0)30-8109 - 7131
> > > >>> 
> > > >>> Mail:     [email protected]
> > > >>> Web:    www.mobile.de
> > > >>> 
> > > >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > >>> 
> > > >>> 
> > > >>> Geschäftsführer: Malte Krüger
> > > >>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > >>> Sitz der Gesellschaft: Kleinmachnow 
> > > > 
> > > > 
> > > > -- 
> > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > Fon:     + 49-(0)30-8109 - 7173
> > > > Fax:     + 49-(0)30-8109 - 7131
> > > > 
> > > > Mail:     [email protected]
> > > > Web:    www.mobile.de
> > > > 
> > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > 
> > > > 
> > > > Geschäftsführer: Malte Krüger
> > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > Sitz der Gesellschaft: Kleinmachnow 
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     [email protected]
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     [email protected]
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     [email protected]
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Reply via email to