(Aside from QOS, I second the notion to review your "failure groups" if you are using and depending on data replication.)
For QOS, some suggestions: You might want to define a set of nodes that will do restripes using `mmcrnodeclass restripers -N ...` You can initially just enable `mmchqos FS --enable` and then monitor performance of your restripefs command `mmrestripefs FS -b -N restripers` that restricts operations to the restripers nodeclass. with `mmlsqos FS --seconds 60 [[see other options]]` Suppose you see an average iops rates of several thousand IOPs and you decide that is interfering with other work... Then, for example, you could "slow down" or "pace" mmrestripefs to use 999 iops within the system pool and 1999 iops within the data pool with: mmchqos FS --enable -N restripers pool=system,maintenance=999iops pool=data,maintenance=1999iops And monitor that with mmlsqos. Tip: For a more graphical view of QOS and disk performance, try samples/charts/qosplotfine.pl. You will need to have gnuplot working... If you are "into" performance tools you might want to look at the --fine-stats options of mmchqos and mmlsqos and plug that into your favorite performance viewer/plotter/analyzer tool(s). (Technical: mmlsqos --fine-stats is written to be used and digested by scripts, no so much for human "eyeballing". The --fine-stats argument of mmchqos is a number of seconds. The --fine-stats argument of mmlsqos is one or two index values. The doc for mmlsqos explains this and the qosplotfine.pl script is an example of how to use it. ) From: "Luis Bolinches" <[email protected]> To: "gpfsug main discussion list" <[email protected]> Date: 08/21/2018 12:56 AM Subject: Re: [gpfsug-discuss] Rebalancing with mmrestripefs -P Sent by: [email protected] Hi You can enable QoS first to see the activity while on inf value to see the current values of usage and set the li is later on. Those limits are modificable online so even in case you have (not your case it seems) less activity times those can be increased for replication then and Lowe again on peak times. — SENT FROM MOBILE DEVICE Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Consultant IT Specialist Mobile Phone: +358503112585 https://www.youracclaim.com/user/luis-bolinches "If you always give you will always have" -- Anonymous > On 21 Aug 2018, at 1.21, [email protected] wrote: > > Yes the arrays are in different buildings. We want to spread the activity over more servers if possible but recognize the extra load that rebalancing would entail. The system is busy all the time. > > I have considered using QOS when we run policy migrations but haven’t yet because I don’t know what value to allow for throttling IOPS. We need to do weekly migrations off of 15k rpm pool onto 7.2k rpm pool, and previously I’ve just let it run at native speed. I’d like to know what other folks have used for QOS settings. > > I think we may leave things alone for now regarding the original question, rebalancing this pool. > > -- ddj > Dave Johnson > >> On Aug 20, 2018, at 6:08 PM, [email protected] wrote: >> >> On Mon, 20 Aug 2018 14:02:05 -0400, "Frederick Stock" said: >> >>> Note you have two additional NSDs in the 33 failure group than you do in >>> the 23 failure group. You may want to change one of those NSDs in failure >>> group 33 to be in failure group 23 so you have equal storage space in both >>> failure groups. >> >> Keep in mind that the failure groups should be built up based on single points of failure. >> In other words, a failure group should consist of disks that will all stay up or all go down on >> the same failure (controller, network, whatever). >> >> Looking at the fact that you have 6 disks named 'dNN_george_33' and 8 named 'dNN_cit_33', >> it sounds very likely that they are in two different storage arrays, and you should make your >> failure groups so they don't span a storage array. In other words, taking a 'cit' disk >> and moving it into a 'george' failure group will Do The Wrong Thing, because if you do >> data replication, one copy can go onto a 'george' disk, and the other onto a 'cit' disk >> that's in the same array as the 'george' disk. If 'george' fails, you lose access to both >> replicas. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edellä ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
