Yea that's what I'm hoping to talk to them about and get them to do. I was just curious to hear what others might have done if they encountered this situation and if there was any tuning on the system to help this. Thanks,
-Steve On May 14, 2009, at 2:24 PM, Si Hammond wrote: > Hi Roy/Steve, > > Could you get the user to put multiple executions inside one submit > script/job so that they get the same executions done but in coarser > grained blocks rather than as individual jobs? > > > > Si Hammond > > High Performance Systems Group > University of Warwick, UK > > On 14 May 2009, at 19:11, Steve Young wrote: > >> Thanks Roy, >> Actually, I'm doing something similar... max_queuable = 120 for the >> execution queue and everyone gets a MAXIJOB of 4. Like you mention >> I'd >> rather not worry about small jobs either and traditionally we've had >> jobs that run for hours or weeks and works perfectly. Just this one >> user (who refuses to look at MPI since they've been told it's too >> hard >> to learn) seems to come up with strange ways of running their >> homegrown code. We're using mysql for a db server. I expect that >> anything I can tweak would be here, similar to what you point out for >> postgres. Any pointers people have for mysql databases and gold? I >> hope at the least I will be able to talk them into running multiple >> commands within one job so to save on the overhead of starting/ >> stopping the job on a node. Thanks again for the advice. >> >> -Steve >> >> >> >> >> >> On May 14, 2009, at 1:23 PM, Roy Dragseth wrote: >> >>> On Thursday 14 May 2009 17:45:22 Steve Young wrote: >>>> Hi all, >>>> I have been experiencing a problem with a user submitting >>>> thousands >>>> of jobs. Out of most of the jobs they seem to either finish in a >>>> matter of seconds or aren't even doing anything. I'm using torque, >>>> maui and gold. Now I'm using a routing queue to contain the 10,000 >>>> jobs they submit (all single cpu jobs). The routing queue works >>>> fine >>>> and routes to the proper execution queue (able to run 116 at a >>>> time). >>>> However, I notice as the system is chewing through the jobs >>>> trying to >>>> execute them they drop off so fast the system is having a hard time >>>> trying to keep up. The mysql server goes to 100% and even a load on >>>> goldd. I suspect it's because the flurry of jobs starting/ >>>> stopping so >>>> fast that creating the reservations and other record-keeping in >>>> maui/ >>>> gold is making this load. >>>> I'm hoping to get the user to make some changes to how they submit >>>> jobs (but they can be difficult at times). I suspect that even if >>>> the >>>> jobs ran for 5 minutes or so that then the system could at least >>>> keep >>>> up. So I'm curious to know if any others ran into this type of >>>> problem >>>> and what you did to solve it. Are there some changes in torque/ >>>> maui/ >>>> gold that I could make to help alleviate this? >>>> >>> >>> I posted the exact same question on the gold list last year (but the >>> archive >>> at pnl.gov is gone and I could not find the thread on the >>> clusterresources >>> archive). >>> >>> If you do not want to write your own layer between maui and gold >>> you're pretty >>> much stuck. >>> >>> We ended up limiting the number of idle and running jobs per user. >>> Per >>> default each user is limited to 200 running jobs and 16 idle jobs >>> >>> USERCFG[DEFAULT] MAXJOB=200 MAXIJOB=16 >>> >>> Our policy is not to optimize the batch system for lots of small >>> jobs. By >>> setting the above limits we sort of encourage our users to adjust >>> their work >>> setup. Even if you bring down the response time from accounting the >>> scaling >>> will be limited, Amdahls law will kick in eventually... >>> >>> If you're using postgres as the backend for gold you should vacuum >>> the >>> database regularly. The gold user has this in crontab >>> >>> # su - gold >>> -bash-3.00$ crontab -l >>> 00 04 * * * sh /opt/gold/vacuum.sh >>> -bash-3.00$ cat /opt/gold/vacuum.sh >>> #!/bin/sh >>> >>> # vacuum the database, makes it run faster. >>> >>> /usr/bin/psql -c "vacuum; vacuum analyze;" >>> >>> >>> Doing this brought down the accounting response down from 6 to 1 >>> seconds. >>> (our db server is really slow...) >>> >>> r. >>> >>> >>> >>> -- >>> The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. >>> phone:+47 77 64 41 07, fax:+47 77 64 41 00 >>> Roy Dragseth, Team Leader, High Performance Computing >>> Direct call: +47 77 64 62 56. email: [email protected] >>> _______________________________________________ >>> mauiusers mailing list >>> [email protected] >>> http://www.supercluster.org/mailman/listinfo/mauiusers >> >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers > > Si Hammond > > Performance Modelling, Analysis and Optimisation Team > High Performance Systems Group > Department of Computer Science > University of Warwick, CV4 7AL, UK > > > > > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
