We have two OSSs, each with two quad core AMD Opterons and 8GB of ram and two OSTs each(4.4T and 3.5T). Backend storage is a pair of Sun StorageTek 2540 connected with 8Gb fiber.
What about tweaking max_dirty_mb on the client side? On Wed, Feb 1, 2012 at 6:33 PM, Carlos Thomaz <[email protected]> wrote: > David, > > The oss service threads is a function of your RAM size and CPUs. It's > difficult to say what would be a good upper limit without knowing the size > of your OSS, # clients, storage back-end and workload. But the good thing > you can give a try on the fly via lctl set_param command. > > Assuming you are running lustre 1.8, here is a good explanation on how to > do it: > http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50651263_ > 87260 > > Some remarks: > - reducing the number of OSS threads may impact the performance depending > on how is your workload. > - unfortunately I guess you will need to try and see what happens. I would > go for 128 and analyze the behavior of your OSSs (via log files) and also > keeping an eye on your workload. Seems to me that 300 is a bit too high > (but again, I don't know what you have on your storage back-end or OSS > configuration). > > > I can't tell you much about the lru_size, but as far as I understand the > values are dynamic and there's not much to do rather than clear the last > recently used queue or disable the lru sizing. I can't help much on this > other than pointing you out the explanation for it (see 31.2.11): > > http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html > > > Regards, > Carlos > > > > > -- > Carlos Thomaz | HPC Systems Architect > Mobile: +1 (303) 519-0578 > [email protected] | Skype ID: carlosthomaz > DataDirect Networks, Inc. > 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 > ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless > <http://twitter.com/ddn_limitless> | 1.800.TERABYTE > > > > > > On 2/1/12 2:11 PM, "David Noriega" <[email protected]> wrote: > >>zone_reclaim_mode is 0 on all clients/servers >> >>When changing number of service threads or the lru_size, can these be >>done on the fly or do they require a reboot of either client or >>server? >>For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started >>give about 300(300, 359) so I'm thinking try half of that and see how >>it goes? >> >>Also checking lru_size, I get different numbers from the clients. cat >>/proc/fs/lustre/ldlm/namespaces/*/lru_size >> >>Client: MDT0 OST0 OST1 OST2 OST3 MGC >>head node: 0 22 22 22 22 400 (only a few users logged in) >>busy node: 1 501 504 503 505 400 (Fully loaded with jobs) >>samba/nfs server: 4 440070 44370 44348 26282 1600 >> >>So my understanding is the lru_size is set to auto by default thus the >>varying values, but setting it manually is effectively setting a max >>value? Also what does it mean to have a lower value(especially in the >>case of the samba/nfs server)? >> >>On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <[email protected]> wrote: >>> >>> You may also want to check and, if necessary, limit the lru_size on >>>your clients. I believe there are guidelines in the ops manual. >>>We have ~750 clients and limit ours to 600 per OST. That, combined >>>with the setting zone_reclaim_mode=0 should make a big difference. >>> >>> Regards, >>> >>> Charlie Taylor >>> UF HPC Center >>> >>> >>> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: >>> >>>> Hi David, >>>> >>>> You may be facing the same issue discussed on previous threads, which >>>>is >>>> the issue regarding the zone_reclaim_mode. >>>> >>>> Take a look on the previous thread where myself and Kevin replied to >>>> Vijesh Ek. >>>> >>>> If you don't have access to the previous emails, look at your kernel >>>> settings for the zone reclaim: >>>> >>>> cat /proc/sys/vm/zone_reclaim_mode >>>> >>>> It should be set to 0. >>>> >>>> Also, look at the number of Lustre OSS service threads. It may be set >>>>to >>>> high... >>>> >>>> Rgds. >>>> Carlos. >>>> >>>> >>>> -- >>>> Carlos Thomaz | HPC Systems Architect >>>> Mobile: +1 (303) 519-0578 >>>> [email protected] | Skype ID: carlosthomaz >>>> DataDirect Networks, Inc. >>>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 >>>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless >>>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE >>>> >>>> >>>> >>>> >>>> >>>> On 2/1/12 11:57 AM, "David Noriega" <[email protected]> wrote: >>>> >>>>> indicates the system was overloaded (too many service threads, or >>>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> Charles A. Taylor, Ph.D. >>> Associate Director, >>> UF HPC Center >>> (352) 392-4036 >>> >>> >>> >> >> >> >>-- >>David Noriega >>System Administrator >>Computational Biology Initiative >>High Performance Computing Center >>University of Texas at San Antonio >>One UTSA Circle >>San Antonio, TX 78249 >>Office: BSE 3.112 >>Phone: 210-458-7100 >>http://www.cbi.utsa.edu >>_______________________________________________ >>Lustre-discuss mailing list >>[email protected] >>http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- David Noriega System Administrator Computational Biology Initiative High Performance Computing Center University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
