RE: [Beowulf] clustering using xen virtualized machines

Mark Hahn Tue, 26 Jan 2010 15:24:41 -0800

Is it just me, or does HPC clustering and virtualization fall on
opposite ends of the spectrum?


depends on your definitions.  virtualization certainly conflicts with
those aspects of HPC which require bare-metal performance.  even if you
can reduce the overhead of virtualization, the question is why?  look

at the basic sort of HPC environment: compute nodes running a singledistro, controlled by a scheduler. from the user's or job's perspective,there are just some nodes - which ones doesn't matter, or even how manyin total. the user _should_ be able to assume that when they land on a node,

it behaves as if freshly installed and booted de novo.  we don't reboot
nodes nodes between jobs, of course, or even make much effort towards

preventing a serial job from noticing other serial jobs on the samenode (as containers would, let alone VMs). but we could, without tonsof effort, just lower utilization.


virtualization is about a few things:
        - improve utilization by coalescing low-duty-cycle services.
        - isolate services from each other - either to directly arbitrate
        runtime resource contention, or to disentangle configurations.
        - encapsulate all the state of a server so it can be moved.

I think the first axis is quite non-HPC, since I don't think of HPC jobs
as being like idle services.  (OTOH, many clusters have good utilization
because multiple workloads get interleaved _above_ the processor level.)
the second factor is not often an HPC problem, at least not in my experience,
where J Random Fortran user doesn't really care that much about the

environment (ie - want f77 and lapack and empty queues). migrationhas some HPC appeal, since it permits defragmenting a cluster,as well as better preemption.

Gavin, not necessarily. You could have a cluster of HPC compute nodes
running a minimal base OS.
Then install specific virtual machines with different OS/software stacks
each time your run a job.


or for each job, just install the provided OS image on the bare metal...
your job's done, have it halt or reboot the node ;)

OK, this is probably more relevant for grid or cloud computing - I first

grid and cloud computing are all part of the same game, no? along withmassively parallel low-latency MPI, old-style vector supercomputing,GPU-assisted computing, throughput serial farming, etc.

thought this would be a good idea when seeing
that (at the time) the CERN LHC Grid software would only run with Redhat
7.2
So you could imagine 'packaging up' a virtual machine which has your
particular OS flavour/libraries/compilers and shipping
it out with the job.

right, that's one of the axes of the problem-space: whether the app gets itsown custom runtime environment (in the sense of kernel, libc, etc). anotheraxis is the degree to which the app has to contend for resources (as in anovercommited normal cluster, or a VM without guaranteed resources.)

Another reason could be fault tolerance - you run VMs on the compute
nodes. When you detect a hardware fault is coming along
(eg from ECC errors or disk errors) you perform a live migration from
one node to another - and your job keeps on trucking.
(In theory, checkpointing needed etc. etc.)

I'm pretty skeptical about this - the main issue with checkpointing iswhen there are external side-effects. checkpointing networked apps

(including MPI) is hard because you have state "in flight", so can only
freeze-dry the state by quiescing (letting the messages land, etc).

the "live migration" demos I've seen have been apps that are tolerantto the loss in-flight transactions (or which retry automatically).

so I don't think virt is any kind of paradigm-changer,just like manycore merely stretches existing definitions.


-mark
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

RE: [Beowulf] clustering using xen virtualized machines

Reply via email to