Hi Matthew, thanks for all your time and work.. see inline for answers..
On Wed, 11 Dec 2013 09:17:32 -0500 Matthew Von-Maszewski <[email protected]> wrote: > The real Riak developers have arrived on-line for the day. They are telling > me that all of your problems are likely due to the extended upgrade times, > and yes there is a known issue with handoff between 1.3 and 1.4. They also > say everything should calm down after all nodes are upgraded. > > I will review your system settings now and see if there is something that > might make the other machines upgrade quicker. So three more questions: > > - what is the average size of your keys bucket names are between 5 and 15 characters (only ~ 10 buckets).. key names are normally something like 26iesj:hovh7egz > > - what is the average size of your value (data stored) I have to guess.. but mean is (from Riak) 12kb but 95th percentile is at 75kb and in theory we have a limit of 1MB (then it will be split up) but sometimes thanks to sibblings (we have to buckets with allow_mult) we have also some 7MB in MAX but this will be reduced again (it's a new feature in our app which has to many parallel wrights below of 15ms). > > - in regular use, are your keys accessed randomly across their entire range, > or do they contain a date component which clusters older, less used keys normally we don't search but retrieve keys by key name.. and we have data which is up to 6 months old and normally we access mostly new/active/hot data and not all the old ones.. besides this we have a job doing a 2i query every 5mins and another one doing this maybe once an hour.. both don't work while the upgrade is ongoing (2i isn't working). Cheers Simon > > Matthew > > > On Dec 11, 2013, at 8:43 AM, Simon Effenberg <[email protected]> > wrote: > > > Oh and at the moment they are waiting for some handoffs and I see > > errors in logfiles: > > > > > > 2013-12-11 13:41:47.948 UTC [error] > > <0.7157.24>@riak_core_handoff_sender:start_fold:269 hinted_handoff > > transfer of riak_kv_vnode from '[email protected]' > > 468137243207554840987117797979434404733540892672 > > > > but I remember that somebody else had this as well and if I recall > > correctly it disappeared after the full upgrade was done.. but at the > > moment it's hard to think about upgrading everything at once.. > > (~12hours 100% disk utilization on all 12 nodes will lead to real slow > > puts/gets) > > > > What can I do? > > > > Cheers > > Simon > > > > PS: transfers output: > > > > '[email protected]' waiting to handoff 17 partitions > > '[email protected]' waiting to handoff 19 partitions > > > > (these are the 1.4.2 nodes) > > > > > > On Wed, 11 Dec 2013 14:39:58 +0100 > > Simon Effenberg <[email protected]> wrote: > > > >> Also some side notes: > >> > >> "top" is even better on new 1.4.2 than on 1.3.1 machines.. IO > >> utilization of disk is mostly the same (round about 33%).. > >> > >> but > >> > >> 95th percentile of response time for get (avg over all nodes): > >> before upgrade: 29ms > >> after upgrade: almost the same > >> > >> 95th percentile of response time for put (avg over all nodes): > >> before upgrade: 60ms > >> after upgrade: 1548ms > >> but this is only because of 2 of 12 nodes are > >> on 1.4.2 and are really slow (17000ms) > >> > >> Cheers, > >> Simon > >> > >> On Wed, 11 Dec 2013 13:45:56 +0100 > >> Simon Effenberg <[email protected]> wrote: > >> > >>> Sorry I forgot the half of it.. > >>> > >>> seffenberg@kriak46-1:~$ free -m > >>> total used free shared buffers cached > >>> Mem: 23999 23759 239 0 184 16183 > >>> -/+ buffers/cache: 7391 16607 > >>> Swap: 0 0 0 > >>> > >>> We have 12 servers.. > >>> datadir on the compacted servers (1.4.2) ~ 765 GB > >>> > >>> AAE is enabled. > >>> > >>> I attached app.config and vm.args. > >>> > >>> Cheers > >>> Simon > >>> > >>> On Wed, 11 Dec 2013 07:33:31 -0500 > >>> Matthew Von-Maszewski <[email protected]> wrote: > >>> > >>>> Ok, I am now suspecting that your servers are either using swap space > >>>> (which is slow) or your leveldb file cache is thrashing (opening and > >>>> closing multiple files per request). > >>>> > >>>> How many servers do you have and do you use Riak's active anti-entropy > >>>> feature? I am going to plug all of this into a spreadsheet. > >>>> > >>>> Matthew Von-Maszewski > >>>> > >>>> > >>>> On Dec 11, 2013, at 7:09, Simon Effenberg <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi Matthew > >>>>> > >>>>> Memory: 23999 MB > >>>>> > >>>>> ring_creation_size, 256 > >>>>> max_open_files, 100 > >>>>> > >>>>> riak-admin status: > >>>>> > >>>>> memory_total : 276001360 > >>>>> memory_processes : 191506322 > >>>>> memory_processes_used : 191439568 > >>>>> memory_system : 84495038 > >>>>> memory_atom : 686993 > >>>>> memory_atom_used : 686560 > >>>>> memory_binary : 21965352 > >>>>> memory_code : 11332732 > >>>>> memory_ets : 10823528 > >>>>> > >>>>> Thanks for looking! > >>>>> > >>>>> Cheers > >>>>> Simon > >>>>> > >>>>> > >>>>> > >>>>> On Wed, 11 Dec 2013 06:44:42 -0500 > >>>>> Matthew Von-Maszewski <[email protected]> wrote: > >>>>> > >>>>>> I need to ask other developers as they arrive for the new day. Does > >>>>>> not make sense to me. > >>>>>> > >>>>>> How many nodes do you have? How much RAM do you have in each node? > >>>>>> What are your settings for max_open_files and cache_size in the > >>>>>> app.config file? Maybe this is as simple as leveldb using too much > >>>>>> RAM in 1.4. The memory accounting for maz_open_files changed in 1.4. > >>>>>> > >>>>>> Matthew Von-Maszewski > >>>>>> > >>>>>> > >>>>>> On Dec 11, 2013, at 6:28, Simon Effenberg <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Matthew, > >>>>>>> > >>>>>>> it took around 11hours for the first node to finish the compaction. > >>>>>>> The > >>>>>>> second node is running already 12 hours and is still doing compaction. > >>>>>>> > >>>>>>> Besides that I wonder because the fsm_put time on the new 1.4.2 host > >>>>>>> is > >>>>>>> much higher (after the compaction) than on an old 1.3.1 (both are > >>>>>>> running in the cluster right now and another one is doing the > >>>>>>> compaction/upgrade while it is in the cluster but not directly > >>>>>>> accessible because it is out of the Loadbalancer): > >>>>>>> > >>>>>>> 1.4.2: > >>>>>>> > >>>>>>> node_put_fsm_time_mean : 2208050 > >>>>>>> node_put_fsm_time_median : 39231 > >>>>>>> node_put_fsm_time_95 : 17400382 > >>>>>>> node_put_fsm_time_99 : 50965752 > >>>>>>> node_put_fsm_time_100 : 59537762 > >>>>>>> node_put_fsm_active : 5 > >>>>>>> node_put_fsm_active_60s : 364 > >>>>>>> node_put_fsm_in_rate : 5 > >>>>>>> node_put_fsm_out_rate : 3 > >>>>>>> node_put_fsm_rejected : 0 > >>>>>>> node_put_fsm_rejected_60s : 0 > >>>>>>> node_put_fsm_rejected_total : 0 > >>>>>>> > >>>>>>> > >>>>>>> 1.3.1: > >>>>>>> > >>>>>>> node_put_fsm_time_mean : 5036 > >>>>>>> node_put_fsm_time_median : 1614 > >>>>>>> node_put_fsm_time_95 : 8789 > >>>>>>> node_put_fsm_time_99 : 38258 > >>>>>>> node_put_fsm_time_100 : 384372 > >>>>>>> > >>>>>>> > >>>>>>> any clue why this could/should be? > >>>>>>> > >>>>>>> Cheers > >>>>>>> Simon > >>>>>>> > >>>>>>> On Tue, 10 Dec 2013 17:21:07 +0100 > >>>>>>> Simon Effenberg <[email protected]> wrote: > >>>>>>> > >>>>>>>> Hi Matthew, > >>>>>>>> > >>>>>>>> thanks!.. that answers my questions! > >>>>>>>> > >>>>>>>> Cheers > >>>>>>>> Simon > >>>>>>>> > >>>>>>>> On Tue, 10 Dec 2013 11:08:32 -0500 > >>>>>>>> Matthew Von-Maszewski <[email protected]> wrote: > >>>>>>>> > >>>>>>>>> 2i is not my expertise, so I had to discuss you concerns with > >>>>>>>>> another Basho developer. He says: > >>>>>>>>> > >>>>>>>>> Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk > >>>>>>>>> format. You must wait for all nodes to update if you desire to use > >>>>>>>>> the new 2i query. The 2i data will properly write/update on both > >>>>>>>>> 1.3 and 1.4 machines during the migration. > >>>>>>>>> > >>>>>>>>> Does that answer your question? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> And yes, you might see available disk space increase during the > >>>>>>>>> upgrade compactions if your dataset contains numerous delete > >>>>>>>>> "tombstones". The Riak 2.0 code includes a new feature called > >>>>>>>>> "aggressive delete" for leveldb. This feature is more proactive in > >>>>>>>>> pushing delete tombstones through the levels to free up disk space > >>>>>>>>> much more quickly (especially if you perform block deletes every > >>>>>>>>> now and then). > >>>>>>>>> > >>>>>>>>> Matthew > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Dec 10, 2013, at 10:44 AM, Simon Effenberg > >>>>>>>>> <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Matthew, > >>>>>>>>>> > >>>>>>>>>> see inline.. > >>>>>>>>>> > >>>>>>>>>> On Tue, 10 Dec 2013 10:38:03 -0500 > >>>>>>>>>> Matthew Von-Maszewski <[email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>>> The sad truth is that you are not the first to see this problem. > >>>>>>>>>>> And yes, it has to do with your 950GB per node dataset. And no, > >>>>>>>>>>> nothing to do but sit through it at this time. > >>>>>>>>>>> > >>>>>>>>>>> While I did extensive testing around upgrade times before > >>>>>>>>>>> shipping 1.4, apparently there are data configurations I did not > >>>>>>>>>>> anticipate. You are likely seeing a cascade where a shift of one > >>>>>>>>>>> file from level-1 to level-2 is causing a shift of another file > >>>>>>>>>>> from level-2 to level-3, which causes a level-3 file to shift to > >>>>>>>>>>> level-4, etc … then the next file shifts from level-1. > >>>>>>>>>>> > >>>>>>>>>>> The bright side of this pain is that you will end up with better > >>>>>>>>>>> write throughput once all the compaction ends. > >>>>>>>>>> > >>>>>>>>>> I have to deal with that.. but my problem is now, if I'm doing this > >>>>>>>>>> node by node it looks like 2i searches aren't possible while 1.3 > >>>>>>>>>> and > >>>>>>>>>> 1.4 nodes exists in the cluster. Is there any problem which leads > >>>>>>>>>> me to > >>>>>>>>>> an 2i repair marathon or could I easily wait for some hours for > >>>>>>>>>> each > >>>>>>>>>> node until all merges are done before I upgrade the next one? (2i > >>>>>>>>>> searches can fail for some time.. the APP isn't having problems > >>>>>>>>>> with > >>>>>>>>>> that but are new inserts with 2i indices processed successfully or > >>>>>>>>>> do > >>>>>>>>>> I have to do the 2i repair?) > >>>>>>>>>> > >>>>>>>>>> /s > >>>>>>>>>> > >>>>>>>>>> one other good think: saving disk space is one advantage ;).. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Riak 2.0's leveldb has code to prevent/reduce compaction > >>>>>>>>>>> cascades, but that is not going to help you today. > >>>>>>>>>>> > >>>>>>>>>>> Matthew > >>>>>>>>>>> > >>>>>>>>>>> On Dec 10, 2013, at 10:26 AM, Simon Effenberg > >>>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi @list, > >>>>>>>>>>>> > >>>>>>>>>>>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. > >>>>>>>>>>>> after > >>>>>>>>>>>> upgrading the first node (out of 12) this node seems to do many > >>>>>>>>>>>> merges. > >>>>>>>>>>>> the sst_* directories changes in size "rapidly" and the node is > >>>>>>>>>>>> having > >>>>>>>>>>>> a disk utilization of 100% all the time. > >>>>>>>>>>>> > >>>>>>>>>>>> I know that there is something like that: > >>>>>>>>>>>> > >>>>>>>>>>>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x > >>>>>>>>>>>> dataset > >>>>>>>>>>>> will initiate an automatic conversion that could pause the > >>>>>>>>>>>> startup of > >>>>>>>>>>>> each node by 3 to 7 minutes. The leveldb data in "level #1" is > >>>>>>>>>>>> being > >>>>>>>>>>>> adjusted such that "level #1" can operate as an overlapped data > >>>>>>>>>>>> level > >>>>>>>>>>>> instead of as a sorted data level. The conversion is simply the > >>>>>>>>>>>> reduction of the number of files in "level #1" to being less > >>>>>>>>>>>> than eight > >>>>>>>>>>>> via normal compaction of data from "level #1" into "level #2". > >>>>>>>>>>>> This is > >>>>>>>>>>>> a one time conversion." > >>>>>>>>>>>> > >>>>>>>>>>>> but it looks much more invasive than explained here or doesn't > >>>>>>>>>>>> have to > >>>>>>>>>>>> do anything with the (probably seen) merges. > >>>>>>>>>>>> > >>>>>>>>>>>> Is this "normal" behavior or could I do anything about it? > >>>>>>>>>>>> > >>>>>>>>>>>> At the moment I'm stucked with the upgrade procedure because > >>>>>>>>>>>> this high > >>>>>>>>>>>> IO load would probably lead to high response times. > >>>>>>>>>>>> > >>>>>>>>>>>> Also we have a lot of data (per node ~950 GB). > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers > >>>>>>>>>>>> Simon > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> riak-users mailing list > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >>>>>>>>>> Fon: + 49-(0)30-8109 - 7173 > >>>>>>>>>> Fax: + 49-(0)30-8109 - 7131 > >>>>>>>>>> > >>>>>>>>>> Mail: [email protected] > >>>>>>>>>> Web: www.mobile.de > >>>>>>>>>> > >>>>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Geschäftsführer: Malte Krüger > >>>>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > >>>>>>>>>> Sitz der Gesellschaft: Kleinmachnow > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >>>>>>>> Fon: + 49-(0)30-8109 - 7173 > >>>>>>>> Fax: + 49-(0)30-8109 - 7131 > >>>>>>>> > >>>>>>>> Mail: [email protected] > >>>>>>>> Web: www.mobile.de > >>>>>>>> > >>>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >>>>>>>> > >>>>>>>> > >>>>>>>> Geschäftsführer: Malte Krüger > >>>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > >>>>>>>> Sitz der Gesellschaft: Kleinmachnow > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> riak-users mailing list > >>>>>>>> [email protected] > >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >>>>>>> Fon: + 49-(0)30-8109 - 7173 > >>>>>>> Fax: + 49-(0)30-8109 - 7131 > >>>>>>> > >>>>>>> Mail: [email protected] > >>>>>>> Web: www.mobile.de > >>>>>>> > >>>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >>>>>>> > >>>>>>> > >>>>>>> Geschäftsführer: Malte Krüger > >>>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > >>>>>>> Sitz der Gesellschaft: Kleinmachnow > >>>>> > >>>>> > >>>>> -- > >>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >>>>> Fon: + 49-(0)30-8109 - 7173 > >>>>> Fax: + 49-(0)30-8109 - 7131 > >>>>> > >>>>> Mail: [email protected] > >>>>> Web: www.mobile.de > >>>>> > >>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >>>>> > >>>>> > >>>>> Geschäftsführer: Malte Krüger > >>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam > >>>>> Sitz der Gesellschaft: Kleinmachnow > >>> > >>> > >>> -- > >>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >>> Fon: + 49-(0)30-8109 - 7173 > >>> Fax: + 49-(0)30-8109 - 7131 > >>> > >>> Mail: [email protected] > >>> Web: www.mobile.de > >>> > >>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >>> > >>> > >>> Geschäftsführer: Malte Krüger > >>> HRB Nr.: 18517 P, Amtsgericht Potsdam > >>> Sitz der Gesellschaft: Kleinmachnow > >> > >> > >> -- > >> Simon Effenberg | Site Ops Engineer | mobile.international GmbH > >> Fon: + 49-(0)30-8109 - 7173 > >> Fax: + 49-(0)30-8109 - 7131 > >> > >> Mail: [email protected] > >> Web: www.mobile.de > >> > >> Marktplatz 1 | 14532 Europarc Dreilinden | Germany > >> > >> > >> Geschäftsführer: Malte Krüger > >> HRB Nr.: 18517 P, Amtsgericht Potsdam > >> Sitz der Gesellschaft: Kleinmachnow > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: [email protected] > > Web: www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: [email protected] Web: www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
