Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Simon Effenberg Wed, 11 Dec 2013 06:50:40 -0800

Hi Matthew,

thanks for all your time and work.. see inline for answers..


On Wed, 11 Dec 2013 09:17:32 -0500
Matthew Von-Maszewski <[email protected]> wrote:

> The real Riak developers have arrived on-line for the day.  They are telling 
> me that all of your problems are likely due to the extended upgrade times, 
> and yes there is a known issue with handoff between 1.3 and 1.4.  They also 
> say everything should calm down after all nodes are upgraded.
> 
> I will review your system settings now and see if there is something that 
> might make the other machines upgrade quicker.  So three more questions:
> 
> - what is the average size of your keys

bucket names are between 5 and 15 characters (only ~ 10 buckets)..
key names are normally something like 26iesj:hovh7egz

> 
> - what is the average size of your value (data stored)

I have to guess.. but mean is (from Riak) 12kb but 95th percentile is
at 75kb and in theory we have a limit of 1MB (then it will be split up)
but sometimes thanks to sibblings (we have to buckets with allow_mult)
we have also some 7MB in MAX but this will be reduced again (it's a new
feature in our app which has to many parallel wrights below of 15ms).

> 
> - in regular use, are your keys accessed randomly across their entire range, 
> or do they contain a date component which clusters older, less used keys

normally we don't search but retrieve keys by key name.. and we have
data which is up to 6 months old and normally we access mostly
new/active/hot data and not all the old ones.. besides this we have a
job doing a 2i query every 5mins and another one doing this maybe once
an hour.. both don't work while the upgrade is ongoing (2i isn't
working).

Cheers
Simon

> 
> Matthew
> 
> 
> On Dec 11, 2013, at 8:43 AM, Simon Effenberg <[email protected]> 
> wrote:
> 
> > Oh and at the moment they are waiting for some handoffs and I see
> > errors in logfiles:
> > 
> > 
> > 2013-12-11 13:41:47.948 UTC [error]
> > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff
> > transfer of riak_kv_vnode from '[email protected]'
> > 468137243207554840987117797979434404733540892672
> > 
> > but I remember that somebody else had this as well and if I recall
> > correctly it disappeared after the full upgrade was done.. but at the
> > moment it's hard to think about upgrading everything at once..
> > (~12hours 100% disk utilization on all 12 nodes will lead to real slow
> > puts/gets)
> > 
> > What can I do?
> > 
> > Cheers
> > Simon
> > 
> > PS: transfers output:
> > 
> > '[email protected]' waiting to handoff 17 partitions
> > '[email protected]' waiting to handoff 19 partitions
> > 
> > (these are the 1.4.2 nodes)
> > 
> > 
> > On Wed, 11 Dec 2013 14:39:58 +0100
> > Simon Effenberg <[email protected]> wrote:
> > 
> >> Also some side notes:
> >> 
> >> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO
> >> utilization of disk is mostly the same (round about 33%)..
> >> 
> >> but
> >> 
> >> 95th percentile of response time for get (avg over all nodes):
> >>  before upgrade: 29ms
> >>  after upgrade: almost the same
> >> 
> >> 95th percentile of response time for put (avg over all nodes):
> >>  before upgrade: 60ms
> >>  after upgrade: 1548ms 
> >>    but this is only because of 2 of 12 nodes are
> >>    on 1.4.2 and are really slow (17000ms)
> >> 
> >> Cheers,
> >> Simon
> >> 
> >> On Wed, 11 Dec 2013 13:45:56 +0100
> >> Simon Effenberg <[email protected]> wrote:
> >> 
> >>> Sorry I forgot the half of it..
> >>> 
> >>> seffenberg@kriak46-1:~$ free -m
> >>>             total       used       free     shared    buffers cached
> >>> Mem:         23999      23759        239          0        184      16183
> >>> -/+ buffers/cache:       7391      16607
> >>> Swap:            0          0          0
> >>> 
> >>> We have 12 servers..
> >>> datadir on the compacted servers (1.4.2) ~ 765 GB
> >>> 
> >>> AAE is enabled.
> >>> 
> >>> I attached app.config and vm.args.
> >>> 
> >>> Cheers
> >>> Simon
> >>> 
> >>> On Wed, 11 Dec 2013 07:33:31 -0500
> >>> Matthew Von-Maszewski <[email protected]> wrote:
> >>> 
> >>>> Ok, I am now suspecting that your servers are either using swap space 
> >>>> (which is slow) or your leveldb file cache is thrashing (opening and 
> >>>> closing multiple files per request).
> >>>> 
> >>>> How many servers do you have and do you use Riak's active anti-entropy 
> >>>> feature?  I am going to plug all of this into a spreadsheet. 
> >>>> 
> >>>> Matthew Von-Maszewski
> >>>> 
> >>>> 
> >>>> On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> 
> >>>> wrote:
> >>>> 
> >>>>> Hi Matthew
> >>>>> 
> >>>>> Memory: 23999 MB
> >>>>> 
> >>>>> ring_creation_size, 256
> >>>>> max_open_files, 100
> >>>>> 
> >>>>> riak-admin status:
> >>>>> 
> >>>>> memory_total : 276001360
> >>>>> memory_processes : 191506322
> >>>>> memory_processes_used : 191439568
> >>>>> memory_system : 84495038
> >>>>> memory_atom : 686993
> >>>>> memory_atom_used : 686560
> >>>>> memory_binary : 21965352
> >>>>> memory_code : 11332732
> >>>>> memory_ets : 10823528
> >>>>> 
> >>>>> Thanks for looking!
> >>>>> 
> >>>>> Cheers
> >>>>> Simon
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> On Wed, 11 Dec 2013 06:44:42 -0500
> >>>>> Matthew Von-Maszewski <[email protected]> wrote:
> >>>>> 
> >>>>>> I need to ask other developers as they arrive for the new day.  Does 
> >>>>>> not make sense to me.
> >>>>>> 
> >>>>>> How many nodes do you have?  How much RAM do you have in each node?  
> >>>>>> What are your settings for max_open_files and cache_size in the 
> >>>>>> app.config file?  Maybe this is as simple as leveldb using too much 
> >>>>>> RAM in 1.4.  The memory accounting for maz_open_files changed in 1.4.
> >>>>>> 
> >>>>>> Matthew Von-Maszewski
> >>>>>> 
> >>>>>> 
> >>>>>> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> 
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi Matthew,
> >>>>>>> 
> >>>>>>> it took around 11hours for the first node to finish the compaction. 
> >>>>>>> The
> >>>>>>> second node is running already 12 hours and is still doing compaction.
> >>>>>>> 
> >>>>>>> Besides that I wonder because the fsm_put time on the new 1.4.2 host 
> >>>>>>> is
> >>>>>>> much higher (after the compaction) than on an old 1.3.1 (both are
> >>>>>>> running in the cluster right now and another one is doing the
> >>>>>>> compaction/upgrade while it is in the cluster but not directly
> >>>>>>> accessible because it is out of the Loadbalancer):
> >>>>>>> 
> >>>>>>> 1.4.2:
> >>>>>>> 
> >>>>>>> node_put_fsm_time_mean : 2208050
> >>>>>>> node_put_fsm_time_median : 39231
> >>>>>>> node_put_fsm_time_95 : 17400382
> >>>>>>> node_put_fsm_time_99 : 50965752
> >>>>>>> node_put_fsm_time_100 : 59537762
> >>>>>>> node_put_fsm_active : 5
> >>>>>>> node_put_fsm_active_60s : 364
> >>>>>>> node_put_fsm_in_rate : 5
> >>>>>>> node_put_fsm_out_rate : 3
> >>>>>>> node_put_fsm_rejected : 0
> >>>>>>> node_put_fsm_rejected_60s : 0
> >>>>>>> node_put_fsm_rejected_total : 0
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 1.3.1:
> >>>>>>> 
> >>>>>>> node_put_fsm_time_mean : 5036
> >>>>>>> node_put_fsm_time_median : 1614
> >>>>>>> node_put_fsm_time_95 : 8789
> >>>>>>> node_put_fsm_time_99 : 38258
> >>>>>>> node_put_fsm_time_100 : 384372
> >>>>>>> 
> >>>>>>> 
> >>>>>>> any clue why this could/should be?
> >>>>>>> 
> >>>>>>> Cheers
> >>>>>>> Simon
> >>>>>>> 
> >>>>>>> On Tue, 10 Dec 2013 17:21:07 +0100
> >>>>>>> Simon Effenberg <[email protected]> wrote:
> >>>>>>> 
> >>>>>>>> Hi Matthew,
> >>>>>>>> 
> >>>>>>>> thanks!.. that answers my questions!
> >>>>>>>> 
> >>>>>>>> Cheers
> >>>>>>>> Simon
> >>>>>>>> 
> >>>>>>>> On Tue, 10 Dec 2013 11:08:32 -0500
> >>>>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> >>>>>>>> 
> >>>>>>>>> 2i is not my expertise, so I had to discuss you concerns with 
> >>>>>>>>> another Basho developer.  He says:
> >>>>>>>>> 
> >>>>>>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk 
> >>>>>>>>> format.  You must wait for all nodes to update if you desire to use 
> >>>>>>>>> the new 2i query.  The 2i data will properly write/update on both 
> >>>>>>>>> 1.3 and 1.4 machines during the migration.
> >>>>>>>>> 
> >>>>>>>>> Does that answer your question?
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> And yes, you might see available disk space increase during the 
> >>>>>>>>> upgrade compactions if your dataset contains numerous delete 
> >>>>>>>>> "tombstones".  The Riak 2.0 code includes a new feature called 
> >>>>>>>>> "aggressive delete" for leveldb.  This feature is more proactive in 
> >>>>>>>>> pushing delete tombstones through the levels to free up disk space 
> >>>>>>>>> much more quickly (especially if you perform block deletes every 
> >>>>>>>>> now and then).
> >>>>>>>>> 
> >>>>>>>>> Matthew
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg 
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>> 
> >>>>>>>>>> Hi Matthew,
> >>>>>>>>>> 
> >>>>>>>>>> see inline..
> >>>>>>>>>> 
> >>>>>>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
> >>>>>>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> The sad truth is that you are not the first to see this problem.  
> >>>>>>>>>>> And yes, it has to do with your 950GB per node dataset.  And no, 
> >>>>>>>>>>> nothing to do but sit through it at this time.
> >>>>>>>>>>> 
> >>>>>>>>>>> While I did extensive testing around upgrade times before 
> >>>>>>>>>>> shipping 1.4, apparently there are data configurations I did not 
> >>>>>>>>>>> anticipate.  You are likely seeing a cascade where a shift of one 
> >>>>>>>>>>> file from level-1 to level-2 is causing a shift of another file 
> >>>>>>>>>>> from level-2 to level-3, which causes a level-3 file to shift to 
> >>>>>>>>>>> level-4, etc … then the next file shifts from level-1.
> >>>>>>>>>>> 
> >>>>>>>>>>> The bright side of this pain is that you will end up with better 
> >>>>>>>>>>> write throughput once all the compaction ends.
> >>>>>>>>>> 
> >>>>>>>>>> I have to deal with that.. but my problem is now, if I'm doing this
> >>>>>>>>>> node by node it looks like 2i searches aren't possible while 1.3 
> >>>>>>>>>> and
> >>>>>>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads 
> >>>>>>>>>> me to
> >>>>>>>>>> an 2i repair marathon or could I easily wait for some hours for 
> >>>>>>>>>> each
> >>>>>>>>>> node until all merges are done before I upgrade the next one? (2i
> >>>>>>>>>> searches can fail for some time.. the APP isn't having problems 
> >>>>>>>>>> with
> >>>>>>>>>> that but are new inserts with 2i indices processed successfully or 
> >>>>>>>>>> do
> >>>>>>>>>> I have to do the 2i repair?)
> >>>>>>>>>> 
> >>>>>>>>>> /s
> >>>>>>>>>> 
> >>>>>>>>>> one other good think: saving disk space is one advantage ;)..
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction 
> >>>>>>>>>>> cascades, but that is not going to help you today.
> >>>>>>>>>>> 
> >>>>>>>>>>> Matthew
> >>>>>>>>>>> 
> >>>>>>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg 
> >>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hi @list,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. 
> >>>>>>>>>>>> after
> >>>>>>>>>>>> upgrading the first node (out of 12) this node seems to do many 
> >>>>>>>>>>>> merges.
> >>>>>>>>>>>> the sst_* directories changes in size "rapidly" and the node is 
> >>>>>>>>>>>> having
> >>>>>>>>>>>> a disk utilization of 100% all the time.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I know that there is something like that:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x 
> >>>>>>>>>>>> dataset
> >>>>>>>>>>>> will initiate an automatic conversion that could pause the 
> >>>>>>>>>>>> startup of
> >>>>>>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is 
> >>>>>>>>>>>> being
> >>>>>>>>>>>> adjusted such that "level #1" can operate as an overlapped data 
> >>>>>>>>>>>> level
> >>>>>>>>>>>> instead of as a sorted data level. The conversion is simply the
> >>>>>>>>>>>> reduction of the number of files in "level #1" to being less 
> >>>>>>>>>>>> than eight
> >>>>>>>>>>>> via normal compaction of data from "level #1" into "level #2". 
> >>>>>>>>>>>> This is
> >>>>>>>>>>>> a one time conversion."
> >>>>>>>>>>>> 
> >>>>>>>>>>>> but it looks much more invasive than explained here or doesn't 
> >>>>>>>>>>>> have to
> >>>>>>>>>>>> do anything with the (probably seen) merges.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Is this "normal" behavior or could I do anything about it?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> At the moment I'm stucked with the upgrade procedure because 
> >>>>>>>>>>>> this high
> >>>>>>>>>>>> IO load would probably lead to high response times.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Also we have a lot of data (per node ~950 GB).
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Cheers
> >>>>>>>>>>>> Simon
> >>>>>>>>>>>> 
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> riak-users mailing list
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> -- 
> >>>>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>>>>>>>> Fon:     + 49-(0)30-8109 - 7173
> >>>>>>>>>> Fax:     + 49-(0)30-8109 - 7131
> >>>>>>>>>> 
> >>>>>>>>>> Mail:     [email protected]
> >>>>>>>>>> Web:    www.mobile.de
> >>>>>>>>>> 
> >>>>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Geschäftsführer: Malte Krüger
> >>>>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>>>>>>>> Sitz der Gesellschaft: Kleinmachnow 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> -- 
> >>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>>>>>> Fon:     + 49-(0)30-8109 - 7173
> >>>>>>>> Fax:     + 49-(0)30-8109 - 7131
> >>>>>>>> 
> >>>>>>>> Mail:     [email protected]
> >>>>>>>> Web:    www.mobile.de
> >>>>>>>> 
> >>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Geschäftsführer: Malte Krüger
> >>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>>>>>> Sitz der Gesellschaft: Kleinmachnow 
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> riak-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>>>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>>>>> Fon:     + 49-(0)30-8109 - 7173
> >>>>>>> Fax:     + 49-(0)30-8109 - 7131
> >>>>>>> 
> >>>>>>> Mail:     [email protected]
> >>>>>>> Web:    www.mobile.de
> >>>>>>> 
> >>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Geschäftsführer: Malte Krüger
> >>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>>>>> Sitz der Gesellschaft: Kleinmachnow 
> >>>>> 
> >>>>> 
> >>>>> -- 
> >>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>>>> Fon:     + 49-(0)30-8109 - 7173
> >>>>> Fax:     + 49-(0)30-8109 - 7131
> >>>>> 
> >>>>> Mail:     [email protected]
> >>>>> Web:    www.mobile.de
> >>>>> 
> >>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>>>> 
> >>>>> 
> >>>>> Geschäftsführer: Malte Krüger
> >>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>>>> Sitz der Gesellschaft: Kleinmachnow 
> >>> 
> >>> 
> >>> -- 
> >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >>> Fon:     + 49-(0)30-8109 - 7173
> >>> Fax:     + 49-(0)30-8109 - 7131
> >>> 
> >>> Mail:     [email protected]
> >>> Web:    www.mobile.de
> >>> 
> >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >>> 
> >>> 
> >>> Geschäftsführer: Malte Krüger
> >>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >>> Sitz der Gesellschaft: Kleinmachnow 
> >> 
> >> 
> >> -- 
> >> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> >> Fon:     + 49-(0)30-8109 - 7173
> >> Fax:     + 49-(0)30-8109 - 7131
> >> 
> >> Mail:     [email protected]
> >> Web:    www.mobile.de
> >> 
> >> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> >> 
> >> 
> >> Geschäftsführer: Malte Krüger
> >> HRB Nr.: 18517 P, Amtsgericht Potsdam
> >> Sitz der Gesellschaft: Kleinmachnow 
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     [email protected]
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> 


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     [email protected]
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Upgrade from 1.3.1 to 1.4.2 => high IO

Reply via email to