Hi,
Jan Ploski wrote:
[EMAIL PROTECTED] schrieb am 05/28/2008 05:43:08 AM:
On May 27, 2008, at 12:20 AM, Yuriy wrote:
We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greater then
160
Why are you treating globus job submissions like an extended batch
queue?
...
Anyone want to chime in?
I would say that treating Globus job submissions like an extended batch
queue should be among the allowable use cases. Users of a local batch
scheduler may view the "Grid" as a drop-in replacement, which they expect
to be at least as easy to use and efficient, just more fault-tolerant and
scalable. The fewer gotchas, incompatibilities, weird issues, and
technical workarounds they must care about, the better. It's hard enough
to convince them to abandon their familiar client software. So if Globus
consumes system resources even for idle jobs, then it seems to me like a
design or implementation flaw in Globus, not a user's misunderstanding. (I
believe it is no longer so bad in GT 4 as it used to be in the earlier
versions.
GRAM2 was notoriously bad at consuming lots of resources even for idle
jobs. GRAM4 is much better at handling many concurrent submissions, and
having jobs queued up in an idle state until processors become
available. Part of the problem is that GRAM and LRMs are loosely
coupled, and interact through log files, ssh sessions, etc... if
standard interfaces (i.e. WS) were defined by each LRM (e.g. Condor,
PBS, SGE, etc), then GRAM could also be more efficient at interacting
with them, but they do not, and hence there is only so much the
implementation can do given the interfaces it currently has. Also,
production LRMs also have scalability problems of their own, where their
performance degrades significantly when their queues grow, or when
status information is queried too often. The LRMs are improving, but
many production Grids are still running older instances of the LRMs,
which had performance and scalability issues under high load.
Your remarks about the need for client-side submission
throttling are of course correct.)
It is important to know the limitations of the resource management
infrastructure, and use throttling to ensure that you stay in the safe
margins of performance.
Cheers,
Ioan
Regards,
Jan Ploski
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: [EMAIL PROTECTED]
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================