Also some side notes:
"top" is even better on new 1.4.2 than on 1.3.1 machines.. IO
utilization of disk is mostly the same (round about 33%)..
but
95th percentile of response time for get (avg over all nodes):
before upgrade: 29ms
after upgrade: almost the same
95th percentile of response time for put (avg over all nodes):
before upgrade: 60ms
after upgrade: 1548ms
but this is only because of 2 of 12 nodes are
on 1.4.2 and are really slow (17000ms)
Cheers,
Simon
On Wed, 11 Dec 2013 13:45:56 +0100
Simon Effenberg <[email protected]> wrote:
> Sorry I forgot the half of it..
>
> seffenberg@kriak46-1:~$ free -m
> total used free shared buffers cached
> Mem: 23999 23759 239 0 184 16183
> -/+ buffers/cache: 7391 16607
> Swap: 0 0 0
>
> We have 12 servers..
> datadir on the compacted servers (1.4.2) ~ 765 GB
>
> AAE is enabled.
>
> I attached app.config and vm.args.
>
> Cheers
> Simon
>
> On Wed, 11 Dec 2013 07:33:31 -0500
> Matthew Von-Maszewski <[email protected]> wrote:
>
> > Ok, I am now suspecting that your servers are either using swap space
> > (which is slow) or your leveldb file cache is thrashing (opening and
> > closing multiple files per request).
> >
> > How many servers do you have and do you use Riak's active anti-entropy
> > feature? I am going to plug all of this into a spreadsheet.
> >
> > Matthew Von-Maszewski
> >
> >
> > On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> wrote:
> >
> > > Hi Matthew
> > >
> > > Memory: 23999 MB
> > >
> > > ring_creation_size, 256
> > > max_open_files, 100
> > >
> > > riak-admin status:
> > >
> > > memory_total : 276001360
> > > memory_processes : 191506322
> > > memory_processes_used : 191439568
> > > memory_system : 84495038
> > > memory_atom : 686993
> > > memory_atom_used : 686560
> > > memory_binary : 21965352
> > > memory_code : 11332732
> > > memory_ets : 10823528
> > >
> > > Thanks for looking!
> > >
> > > Cheers
> > > Simon
> > >
> > >
> > >
> > > On Wed, 11 Dec 2013 06:44:42 -0500
> > > Matthew Von-Maszewski <[email protected]> wrote:
> > >
> > >> I need to ask other developers as they arrive for the new day. Does not
> > >> make sense to me.
> > >>
> > >> How many nodes do you have? How much RAM do you have in each node?
> > >> What are your settings for max_open_files and cache_size in the
> > >> app.config file? Maybe this is as simple as leveldb using too much RAM
> > >> in 1.4. The memory accounting for maz_open_files changed in 1.4.
> > >>
> > >> Matthew Von-Maszewski
> > >>
> > >>
> > >> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]>
> > >> wrote:
> > >>
> > >>> Hi Matthew,
> > >>>
> > >>> it took around 11hours for the first node to finish the compaction. The
> > >>> second node is running already 12 hours and is still doing compaction.
> > >>>
> > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host is
> > >>> much higher (after the compaction) than on an old 1.3.1 (both are
> > >>> running in the cluster right now and another one is doing the
> > >>> compaction/upgrade while it is in the cluster but not directly
> > >>> accessible because it is out of the Loadbalancer):
> > >>>
> > >>> 1.4.2:
> > >>>
> > >>> node_put_fsm_time_mean : 2208050
> > >>> node_put_fsm_time_median : 39231
> > >>> node_put_fsm_time_95 : 17400382
> > >>> node_put_fsm_time_99 : 50965752
> > >>> node_put_fsm_time_100 : 59537762
> > >>> node_put_fsm_active : 5
> > >>> node_put_fsm_active_60s : 364
> > >>> node_put_fsm_in_rate : 5
> > >>> node_put_fsm_out_rate : 3
> > >>> node_put_fsm_rejected : 0
> > >>> node_put_fsm_rejected_60s : 0
> > >>> node_put_fsm_rejected_total : 0
> > >>>
> > >>>
> > >>> 1.3.1:
> > >>>
> > >>> node_put_fsm_time_mean : 5036
> > >>> node_put_fsm_time_median : 1614
> > >>> node_put_fsm_time_95 : 8789
> > >>> node_put_fsm_time_99 : 38258
> > >>> node_put_fsm_time_100 : 384372
> > >>>
> > >>>
> > >>> any clue why this could/should be?
> > >>>
> > >>> Cheers
> > >>> Simon
> > >>>
> > >>> On Tue, 10 Dec 2013 17:21:07 +0100
> > >>> Simon Effenberg <[email protected]> wrote:
> > >>>
> > >>>> Hi Matthew,
> > >>>>
> > >>>> thanks!.. that answers my questions!
> > >>>>
> > >>>> Cheers
> > >>>> Simon
> > >>>>
> > >>>> On Tue, 10 Dec 2013 11:08:32 -0500
> > >>>> Matthew Von-Maszewski <[email protected]> wrote:
> > >>>>
> > >>>>> 2i is not my expertise, so I had to discuss you concerns with another
> > >>>>> Basho developer. He says:
> > >>>>>
> > >>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk
> > >>>>> format. You must wait for all nodes to update if you desire to use
> > >>>>> the new 2i query. The 2i data will properly write/update on both 1.3
> > >>>>> and 1.4 machines during the migration.
> > >>>>>
> > >>>>> Does that answer your question?
> > >>>>>
> > >>>>>
> > >>>>> And yes, you might see available disk space increase during the
> > >>>>> upgrade compactions if your dataset contains numerous delete
> > >>>>> "tombstones". The Riak 2.0 code includes a new feature called
> > >>>>> "aggressive delete" for leveldb. This feature is more proactive in
> > >>>>> pushing delete tombstones through the levels to free up disk space
> > >>>>> much more quickly (especially if you perform block deletes every now
> > >>>>> and then).
> > >>>>>
> > >>>>> Matthew
> > >>>>>
> > >>>>>
> > >>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg
> > >>>>> <[email protected]> wrote:
> > >>>>>
> > >>>>>> Hi Matthew,
> > >>>>>>
> > >>>>>> see inline..
> > >>>>>>
> > >>>>>> On Tue, 10 Dec 2013 10:38:03 -0500
> > >>>>>> Matthew Von-Maszewski <[email protected]> wrote:
> > >>>>>>
> > >>>>>>> The sad truth is that you are not the first to see this problem.
> > >>>>>>> And yes, it has to do with your 950GB per node dataset. And no,
> > >>>>>>> nothing to do but sit through it at this time.
> > >>>>>>>
> > >>>>>>> While I did extensive testing around upgrade times before shipping
> > >>>>>>> 1.4, apparently there are data configurations I did not anticipate.
> > >>>>>>> You are likely seeing a cascade where a shift of one file from
> > >>>>>>> level-1 to level-2 is causing a shift of another file from level-2
> > >>>>>>> to level-3, which causes a level-3 file to shift to level-4, etc …
> > >>>>>>> then the next file shifts from level-1.
> > >>>>>>>
> > >>>>>>> The bright side of this pain is that you will end up with better
> > >>>>>>> write throughput once all the compaction ends.
> > >>>>>>
> > >>>>>> I have to deal with that.. but my problem is now, if I'm doing this
> > >>>>>> node by node it looks like 2i searches aren't possible while 1.3 and
> > >>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads me
> > >>>>>> to
> > >>>>>> an 2i repair marathon or could I easily wait for some hours for each
> > >>>>>> node until all merges are done before I upgrade the next one? (2i
> > >>>>>> searches can fail for some time.. the APP isn't having problems with
> > >>>>>> that but are new inserts with 2i indices processed successfully or do
> > >>>>>> I have to do the 2i repair?)
> > >>>>>>
> > >>>>>> /s
> > >>>>>>
> > >>>>>> one other good think: saving disk space is one advantage ;)..
> > >>>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction cascades,
> > >>>>>>> but that is not going to help you today.
> > >>>>>>>
> > >>>>>>> Matthew
> > >>>>>>>
> > >>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Hi @list,
> > >>>>>>>>
> > >>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after
> > >>>>>>>> upgrading the first node (out of 12) this node seems to do many
> > >>>>>>>> merges.
> > >>>>>>>> the sst_* directories changes in size "rapidly" and the node is
> > >>>>>>>> having
> > >>>>>>>> a disk utilization of 100% all the time.
> > >>>>>>>>
> > >>>>>>>> I know that there is something like that:
> > >>>>>>>>
> > >>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x
> > >>>>>>>> dataset
> > >>>>>>>> will initiate an automatic conversion that could pause the startup
> > >>>>>>>> of
> > >>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is
> > >>>>>>>> being
> > >>>>>>>> adjusted such that "level #1" can operate as an overlapped data
> > >>>>>>>> level
> > >>>>>>>> instead of as a sorted data level. The conversion is simply the
> > >>>>>>>> reduction of the number of files in "level #1" to being less than
> > >>>>>>>> eight
> > >>>>>>>> via normal compaction of data from "level #1" into "level #2".
> > >>>>>>>> This is
> > >>>>>>>> a one time conversion."
> > >>>>>>>>
> > >>>>>>>> but it looks much more invasive than explained here or doesn't
> > >>>>>>>> have to
> > >>>>>>>> do anything with the (probably seen) merges.
> > >>>>>>>>
> > >>>>>>>> Is this "normal" behavior or could I do anything about it?
> > >>>>>>>>
> > >>>>>>>> At the moment I'm stucked with the upgrade procedure because this
> > >>>>>>>> high
> > >>>>>>>> IO load would probably lead to high response times.
> > >>>>>>>>
> > >>>>>>>> Also we have a lot of data (per node ~950 GB).
> > >>>>>>>>
> > >>>>>>>> Cheers
> > >>>>>>>> Simon
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> riak-users mailing list
> > >>>>>>>> [email protected]
> > >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>>>>> Fon: + 49-(0)30-8109 - 7173
> > >>>>>> Fax: + 49-(0)30-8109 - 7131
> > >>>>>>
> > >>>>>> Mail: [email protected]
> > >>>>>> Web: www.mobile.de
> > >>>>>>
> > >>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>>>>>
> > >>>>>>
> > >>>>>> Geschäftsführer: Malte Krüger
> > >>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>>>>> Sitz der Gesellschaft: Kleinmachnow
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>>> Fon: + 49-(0)30-8109 - 7173
> > >>>> Fax: + 49-(0)30-8109 - 7131
> > >>>>
> > >>>> Mail: [email protected]
> > >>>> Web: www.mobile.de
> > >>>>
> > >>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>>>
> > >>>>
> > >>>> Geschäftsführer: Malte Krüger
> > >>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>>> Sitz der Gesellschaft: Kleinmachnow
> > >>>>
> > >>>> _______________________________________________
> > >>>> riak-users mailing list
> > >>>> [email protected]
> > >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >>>
> > >>>
> > >>> --
> > >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > >>> Fon: + 49-(0)30-8109 - 7173
> > >>> Fax: + 49-(0)30-8109 - 7131
> > >>>
> > >>> Mail: [email protected]
> > >>> Web: www.mobile.de
> > >>>
> > >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >>>
> > >>>
> > >>> Geschäftsführer: Malte Krüger
> > >>> HRB Nr.: 18517 P, Amtsgericht Potsdam
> > >>> Sitz der Gesellschaft: Kleinmachnow
> > >
> > >
> > > --
> > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > Fon: + 49-(0)30-8109 - 7173
> > > Fax: + 49-(0)30-8109 - 7131
> > >
> > > Mail: [email protected]
> > > Web: www.mobile.de
> > >
> > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > >
> > >
> > > Geschäftsführer: Malte Krüger
> > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > Sitz der Gesellschaft: Kleinmachnow
>
>
> --
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon: + 49-(0)30-8109 - 7173
> Fax: + 49-(0)30-8109 - 7131
>
> Mail: [email protected]
> Web: www.mobile.de
>
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>
>
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow
--
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon: + 49-(0)30-8109 - 7173
Fax: + 49-(0)30-8109 - 7131
Mail: [email protected]
Web: www.mobile.de
Marktplatz 1 | 14532 Europarc Dreilinden | Germany
Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com