Oh and at the moment they are waiting for some handoffs and I see errors in logfiles:
2013-12-11 13:41:47.948 UTC [error] <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff transfer of riak_kv_vnode from '[email protected]' 468137243207554840987117797979434404733540892672 but I remember that somebody else had this as well and if I recall correctly it disappeared after the full upgrade was done.. but at the moment it's hard to think about upgrading everything at once.. (~12hours 100% disk utilization on all 12 nodes will lead to real slow puts/gets) What can I do? Cheers Simon PS: transfers output: '[email protected]' waiting to handoff 17 partitions '[email protected]' waiting to handoff 19 partitions (these are the 1.4.2 nodes) On Wed, 11 Dec 2013 14:39:58 +0100 Simon Effenberg <[email protected]> wrote: > Also some side notes: > > "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO > utilization of disk is mostly the same (round about 33%).. > > but > > 95th percentile of response time for get (avg over all nodes): > before upgrade: 29ms > after upgrade: almost the same > > 95th percentile of response time for put (avg over all nodes): > before upgrade: 60ms > after upgrade: 1548ms > but this is only because of 2 of 12 nodes are > on 1.4.2 and are really slow (17000ms) > > Cheers, > Simon > > On Wed, 11 Dec 2013 13:45:56 +0100 > Simon Effenberg <[email protected]> wrote: > > > Sorry I forgot the half of it.. > > > > seffenberg@kriak46-1:~$ free -m > > total used free shared buffers cached > > Mem: 23999 23759 239 0 184 16183 > > -/+ buffers/cache: 7391 16607 > > Swap: 0 0 0 > > > > We have 12 servers.. > > datadir on the compacted servers (1.4.2) ~ 765 GB > > > > AAE is enabled. > > > > I attached app.config and vm.args. > > > > Cheers > > Simon > > > > On Wed, 11 Dec 2013 07:33:31 -0500 > > Matthew Von-Maszewski <[email protected]> wrote: > > > > > Ok, I am now suspecting that your servers are either using swap space > > > (which is slow) or your leveldb file cache is thrashing (opening and > > > closing multiple files per request). > > > > > > How many servers do you have and do you use Riak's active anti-entropy > > > feature? I am going to plug all of this into a spreadsheet. > > > > > > Matthew Von-Maszewski > > > > > > > > > On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> > > > wrote: > > > > > > > Hi Matthew > > > > > > > > Memory: 23999 MB > > > > > > > > ring_creation_size, 256 > > > > max_open_files, 100 > > > > > > > > riak-admin status: > > > > > > > > memory_total : 276001360 > > > > memory_processes : 191506322 > > > > memory_processes_used : 191439568 > > > > memory_system : 84495038 > > > > memory_atom : 686993 > > > > memory_atom_used : 686560 > > > > memory_binary : 21965352 > > > > memory_code : 11332732 > > > > memory_ets : 10823528 > > > > > > > > Thanks for looking! > > > > > > > > Cheers > > > > Simon > > > > > > > > > > > > > > > > On Wed, 11 Dec 2013 06:44:42 -0500 > > > > Matthew Von-Maszewski <[email protected]> wrote: > > > > > > > >> I need to ask other developers as they arrive for the new day. Does > > > >> not make sense to me. > > > >> > > > >> How many nodes do you have? How much RAM do you have in each node? > > > >> What are your settings for max_open_files and cache_size in the > > > >> app.config file? Maybe this is as simple as leveldb using too much > > > >> RAM in 1.4. The memory accounting for maz_open_files changed in 1.4. > > > >> > > > >> Matthew Von-Maszewski > > > >> > > > >> > > > >> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> > > > >> wrote: > > > >> > > > >>> Hi Matthew, > > > >>> > > > >>> it took around 11hours for the first node to finish the compaction. > > > >>> The > > > >>> second node is running already 12 hours and is still doing compaction. > > > >>> > > > >>> Besides that I wonder because the fsm_put time on the new 1.4.2 host > > > >>> is > > > >>> much higher (after the compaction) than on an old 1.3.1 (both are > > > >>> running in the cluster right now and another one is doing the > > > >>> compaction/upgrade while it is in the cluster but not directly > > > >>> accessible because it is out of the Loadbalancer): > > > >>> > > > >>> 1.4.2: > > > >>> > > > >>> node_put_fsm_time_mean : 2208050 > > > >>> node_put_fsm_time_median : 39231 > > > >>> node_put_fsm_time_95 : 17400382 > > > >>> node_put_fsm_time_99 : 50965752 > > > >>> node_put_fsm_time_100 : 59537762 > > > >>> node_put_fsm_active : 5 > > > >>> node_put_fsm_active_60s : 364 > > > >>> node_put_fsm_in_rate : 5 > > > >>> node_put_fsm_out_rate : 3 > > > >>> node_put_fsm_rejected : 0 > > > >>> node_put_fsm_rejected_60s : 0 > > > >>> node_put_fsm_rejected_total : 0 > > > >>> > > > >>> > > > >>> 1.3.1: > > > >>> > > > >>> node_put_fsm_time_mean : 5036 > > > >>> node_put_fsm_time_median : 1614 > > > >>> node_put_fsm_time_95 : 8789 > > > >>> node_put_fsm_time_99 : 38258 > > > >>> node_put_fsm_time_100 : 384372 > > > >>> > > > >>> > > > >>> any clue why this could/should be? > > > >>> > > > >>> Cheers > > > >>> Simon > > > >>> > > > >>> On Tue, 10 Dec 2013 17:21:07 +0100 > > > >>> Simon Effenberg <[email protected]> wrote: > > > >>> > > > >>>> Hi Matthew, > > > >>>> > > > >>>> thanks!.. that answers my questions! > > > >>>> > > > >>>> Cheers > > > >>>> Simon > > > >>>> > > > >>>> On Tue, 10 Dec 2013 11:08:32 -0500 > > > >>>> Matthew Von-Maszewski <[email protected]> wrote: > > > >>>> > > > >>>>> 2i is not my expertise, so I had to discuss you concerns with > > > >>>>> another Basho developer. He says: > > > >>>>> > > > >>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > > > >>>>> format. You must wait for all nodes to update if you desire to use > > > >>>>> the new 2i query. The 2i data will properly write/update on both > > > >>>>> 1.3 and 1.4 machines during the migration. > > > >>>>> > > > >>>>> Does that answer your question? > > > >>>>> > > > >>>>> > > > >>>>> And yes, you might see available disk space increase during the > > > >>>>> upgrade compactions if your dataset contains numerous delete > > > >>>>> "tombstones". The Riak 2.0 code includes a new feature called > > > >>>>> "aggressive delete" for leveldb. This feature is more proactive in > > > >>>>> pushing delete tombstones through the levels to free up disk space > > > >>>>> much more quickly (especially if you perform block deletes every > > > >>>>> now and then). > > > >>>>> > > > >>>>> Matthew > > > >>>>> > > > >>>>> > > > >>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg > > > >>>>> <[email protected]> wrote: > > > >>>>> > > > >>>>>> Hi Matthew, > > > >>>>>> > > > >>>>>> see inline.. > > > >>>>>> > > > >>>>>> On Tue, 10 Dec 2013 10:38:03 -0500 > > > >>>>>> Matthew Von-Maszewski <[email protected]> wrote: > > > >>>>>> > > > >>>>>>> The sad truth is that you are not the first to see this problem. > > > >>>>>>> And yes, it has to do with your 950GB per node dataset. And no, > > > >>>>>>> nothing to do but sit through it at this time. > > > >>>>>>> > > > >>>>>>> While I did extensive testing around upgrade times before > > > >>>>>>> shipping 1.4, apparently there are data configurations I did not > > > >>>>>>> anticipate. You are likely seeing a cascade where a shift of one > > > >>>>>>> file from level-1 to level-2 is causing a shift of another file > > > >>>>>>> from level-2 to level-3, which causes a level-3 file to shift to > > > >>>>>>> level-4, etc … then the next file shifts from level-1. > > > >>>>>>> > > > >>>>>>> The bright side of this pain is that you will end up with better > > > >>>>>>> write throughput once all the compaction ends. > > > >>>>>> > > > >>>>>> I have to deal with that.. but my problem is now, if I'm doing this > > > >>>>>> node by node it looks like 2i searches aren't possible while 1.3 > > > >>>>>> and > > > >>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads > > > >>>>>> me to > > > >>>>>> an 2i repair marathon or could I easily wait for some hours for > > > >>>>>> each > > > >>>>>> node until all merges are done before I upgrade the next one? (2i > > > >>>>>> searches can fail for some time.. the APP isn't having problems > > > >>>>>> with > > > >>>>>> that but are new inserts with 2i indices processed successfully or > > > >>>>>> do > > > >>>>>> I have to do the 2i repair?) > > > >>>>>> > > > >>>>>> /s > > > >>>>>> > > > >>>>>> one other good think: saving disk space is one advantage ;).. > > > >>>>>> > > > >>>>>> > > > >>>>>>> > > > >>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction > > > >>>>>>> cascades, but that is not going to help you today. > > > >>>>>>> > > > >>>>>>> Matthew > > > >>>>>>> > > > >>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg > > > >>>>>>> <[email protected]> wrote: > > > >>>>>>> > > > >>>>>>>> Hi @list, > > > >>>>>>>> > > > >>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. > > > >>>>>>>> after > > > >>>>>>>> upgrading the first node (out of 12) this node seems to do many > > > >>>>>>>> merges. > > > >>>>>>>> the sst_* directories changes in size "rapidly" and the node is > > > >>>>>>>> having > > > >>>>>>>> a disk utilization of 100% all the time. > > > >>>>>>>> > > > >>>>>>>> I know that there is something like that: > > > >>>>>>>> > > > >>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x > > > >>>>>>>> dataset > > > >>>>>>>> will initiate an automatic conversion that could pause the > > > >>>>>>>> startup of > > > >>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is > > > >>>>>>>> being > > > >>>>>>>> adjusted such that "level #1" can operate as an overlapped data > > > >>>>>>>> level > > > >>>>>>>> instead of as a sorted data level. The conversion is simply the > > > >>>>>>>> reduction of the number of files in "level #1" to being less > > > >>>>>>>> than eight > > > >>>>>>>> via normal compaction of data from "level #1" into "level #2". > > > >>>>>>>> This is > > > >>>>>>>> a one time conversion." > > > >>>>>>>> > > > >>>>>>>> but it looks much more invasive than explained here or doesn't > > > >>>>>>>> have to > > > >>>>>>>> do anything with the (probably seen) merges. > > > >>>>>>>> > > > >>>>>>>> Is this "normal" behavior or could I do anything about it? > > > >>>>>>>> > > > >>>>>>>> At the moment I'm stucked with the upgrade procedure because > > > >>>>>>>> this high > > > >>>>>>>> IO load would probably lead to high response times. > > > >>>>>>>> > > > >>>>>>>> Also we have a lot of data (per node ~950 GB). > > > >>>>>>>> > > > >>>>>>>> Cheers > > > >>>>>>>> Simon > > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > > >>>>>>>> riak-users mailing list > > > >>>>>>>> [email protected] > > > >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > >>>>>> Fon: + 49-(0)30-8109 - 7173 > > > >>>>>> Fax: + 49-(0)30-8109 - 7131 > > > >>>>>> > > > >>>>>> Mail: [email protected] > > > >>>>>> Web: www.mobile.de > > > >>>>>> > > > >>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > >>>>>> > > > >>>>>> > > > >>>>>> Geschäftsführer: Malte Krüger > > > >>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > > > >>>>>> Sitz der Gesellschaft: Kleinmachnow > > > >>>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > >>>> Fon: + 49-(0)30-8109 - 7173 > > > >>>> Fax: + 49-(0)30-8109 - 7131 > > > >>>> > > > >>>> Mail: [email protected] > > > >>>> Web: www.mobile.de > > > >>>> > > > >>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > >>>> > > > >>>> > > > >>>> Geschäftsführer: Malte Krüger > > > >>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > > > >>>> Sitz der Gesellschaft: Kleinmachnow > > > >>>> > > > >>>> _______________________________________________ > > > >>>> riak-users mailing list > > > >>>> [email protected] > > > >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > >>> > > > >>> > > > >>> -- > > > >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > >>> Fon: + 49-(0)30-8109 - 7173 > > > >>> Fax: + 49-(0)30-8109 - 7131 > > > >>> > > > >>> Mail: [email protected] > > > >>> Web: www.mobile.de > > > >>> > > > >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > >>> > > > >>> > > > >>> Geschäftsführer: Malte Krüger > > > >>> HRB Nr.: 18517 P, Amtsgericht Potsdam > > > >>> Sitz der Gesellschaft: Kleinmachnow > > > > > > > > > > > > -- > > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > > Fon: + 49-(0)30-8109 - 7173 > > > > Fax: + 49-(0)30-8109 - 7131 > > > > > > > > Mail: [email protected] > > > > Web: www.mobile.de > > > > > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > > > > > > > Geschäftsführer: Malte Krüger > > > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > > > Sitz der Gesellschaft: Kleinmachnow > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: [email protected] > > Web: www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: [email protected] > Web: www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: [email protected] Web: www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
