On Jul 24, 2008, at 3:18 AM, Hans-Martin Adorf wrote:

We once had a "sampling" use case that would have required to submit
10,000 or more short-running jobs to the grid, in bunches of 100 or so
at the same time, but refrained from even trying to implementit due to
the latency issues discussed in this thread.

This is the kind of use-case that Ioan Raicu has been writing about. The standard grid way to deal with this is to run what are called "pilot jobs". Instead of submiting 100 short running jobs, you submit some small number of jobs, and leave them as longer-running control nodes that you feed your shorter jobs to. Ioan's particular implementation is Falkon, but people have also done this kind of thing with Condor-G glide-ins and other technologies.

It's a good way of reducing your scheduling startup costs, and is in use by several production grids.


Charles

Reply via email to