Hi,

Jan Ploski wrote:
[EMAIL PROTECTED] schrieb am 05/28/2008 05:43:08 AM:

On May 27, 2008, at 12:20 AM, Yuriy wrote:

We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greater then
160
Why are you treating globus job submissions like an extended batch queue?

...

Anyone want to chime in?

I would say that treating Globus job submissions like an extended batch queue should be among the allowable use cases. Users of a local batch scheduler may view the "Grid" as a drop-in replacement, which they expect to be at least as easy to use and efficient, just more fault-tolerant and scalable. The fewer gotchas, incompatibilities, weird issues, and technical workarounds they must care about, the better. It's hard enough to convince them to abandon their familiar client software. So if Globus consumes system resources even for idle jobs, then it seems to me like a design or implementation flaw in Globus, not a user's misunderstanding. (I believe it is no longer so bad in GT 4 as it used to be in the earlier versions.
GRAM2 was notoriously bad at consuming lots of resources even for idle jobs. GRAM4 is much better at handling many concurrent submissions, and having jobs queued up in an idle state until processors become available. Part of the problem is that GRAM and LRMs are loosely coupled, and interact through log files, ssh sessions, etc... if standard interfaces (i.e. WS) were defined by each LRM (e.g. Condor, PBS, SGE, etc), then GRAM could also be more efficient at interacting with them, but they do not, and hence there is only so much the implementation can do given the interfaces it currently has. Also, production LRMs also have scalability problems of their own, where their performance degrades significantly when their queues grow, or when status information is queried too often. The LRMs are improving, but many production Grids are still running older instances of the LRMs, which had performance and scalability issues under high load.
Your remarks about the need for client-side submission throttling are of course correct.)
It is important to know the limitations of the resource management infrastructure, and use throttling to ensure that you stay in the safe margins of performance.

Cheers,
Ioan
Regards,
Jan Ploski



--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: [EMAIL PROTECTED]
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


Reply via email to