Here is the pointer to Falkon:
http://dev.globus.org/wiki/Incubator/Falkon

BTW, with Falkon, 10K short running tasks can be completed in a matter of a few seconds (1~10 sec), given a fast enough file system and enough processors! There is of course the overhead to allocate the initial resources, which is usually on the order of 10~60 seconds if the resources are available, but this is a one time cost.

Ioan

Charles Bacon wrote:
On Jul 24, 2008, at 3:18 AM, Hans-Martin Adorf wrote:

We once had a "sampling" use case that would have required to submit
10,000 or more short-running jobs to the grid, in bunches of 100 or so
at the same time, but refrained from even trying to implementit due to
the latency issues discussed in this thread.

This is the kind of use-case that Ioan Raicu has been writing about. The standard grid way to deal with this is to run what are called "pilot jobs". Instead of submiting 100 short running jobs, you submit some small number of jobs, and leave them as longer-running control nodes that you feed your shorter jobs to. Ioan's particular implementation is Falkon, but people have also done this kind of thing with Condor-G glide-ins and other technologies.

It's a good way of reducing your scheduling startup costs, and is in use by several production grids.


Charles



--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: [EMAIL PROTECTED]
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


Reply via email to