HTCondor batch systems

Van Klaveren, Brian N. Thu, 16 Jun 2016 11:56:21 -0700

Thanks for your input.

After thinking about it and looking code over a bit more, I'm thinking of a new 
approach where I try to adapt the batch system (SLURM on a Cray in my case) to 
fit in with lightly modified Mesos executor rather than create a new executor.


1. As needed, start up Mesos agent on a full batch node (in my case, 32 
cores+128GB memory) with a 24 hour wall clock time. Add the wall clock ending 
time as an attribute to the mesos agent.
2. Extend BashOperator as BatchOperator (include core count, and memory, use 
the execution_timeout as wall clock). Otherwise use defaults as the mesos 
operator currently does.
3. Extend the MesosOperator to examine the offer's wall clock attribute (set 
when launching agent). If the job matches the offer (execution_timeout, and 
cpu/memory if it's a BatchOperator), accept the offer. 
4. (later) add and kill agents as needed 

So, I'm effectively using the batch system as a mechanism for temporary 
provisioning of mesos workers/agents. We'll see how this experiment works out.

Brian


> On Jun 6, 2016, at 2:56 PM, Lance Norskog <[email protected]> wrote:
> 
> We don't do remote control, but we do have custom servers for Java apps. We
> wrote our own web services to wrap these Java processes. As long as you use
> standard cloud-style "the other end is flaky" design principles, this
> should serve you well.
> 
> There is an option in Airflow for different flavors of "celery" executors.
> Each executor talks directly to a central database. I would not try to run
> it across remote sites.
> 
> On Mon, Jun 6, 2016 at 2:42 PM, Van Klaveren, Brian N. <
> [email protected]> wrote:
> 
>> Hi,
>> 
>> I'm interested in integrating some traditional batch systems with Airflow
>> so I can run against any available batch resources. My use case is that I'd
>> like to run a single airflow instance as a multi-tenant service which can
>> dispatch to heterogeneous batch systems across the physical globe. A system
>> I maintain does this, and I know HTCondor+DAGMan can do this by treating
>> the batch systems as "grid resources". I'm trying to understand if this
>> makes sense to even try with Airflow, so I have a few questions.
>> 
>> 1. Has anyone looked into or tried this before? I've searched for several
>> hours and was unable to find much on this
>> 
>> 2. I have a rough idea how AirFlow works but I haven't dug deep into the
>> code. If I was to implement something like this, should this be done as an
>> operator (i.e. extend BashOperator?) or executor (Mesos Executor) or maybe
>> both?
>> 
>> 3. I've done this thing in the past, and typically you end up with a
>> daemon/microservice running for each batch system. That microservice may be
>> local to the batch system (works best in the case of LSF/torque/etc), or it
>> may be local to the workflow engine but using some sort of exported remote
>> API (e.g. grid-connected resources, often using globus APIs and x509
>> certs), or there may be another layer of abstraction involved (in the case
>> of DIRAC). Then you have a wrapper/pilot script which will trap a few
>> signals and communicate back to the microservice or ot message queue
>> (usually through HTTP or email because some batch systems are behind
>> restrictive firewalls) when a job actually starts or finishes.
>> 
>> Thanks,
>> Brian
> 
> 
> 
> 
> -- 
> Lance Norskog
> [email protected]
> Redwood City, CA

Re: Integrating SLURM/Torque/GridEngine/LSF/DIRAC/HTCondor batch systems

Reply via email to