Re: [openstack-dev] [nova] Proposal for an Experiment

Joshua Harlow Mon, 20 Jul 2015 16:04:19 -0700

Clint Byrum wrote:

Excerpts from Chris Friesen's message of 2015-07-20 14:30:53 -0700:

On 07/20/2015 02:04 PM, Clint Byrum wrote:

Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700:

Some questions:


1) Could you elaborate a bit on how this would work?  I don't quite understand
how you would handle a request for booting an instance with a certain set of
resources--would you queue up a message for each resource?

Please be concrete on what you mean by resource.

I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx 
ratios,
then each flavor is a queue. Thats the easiest problem to solve. Then if
you have a single special thing that can only have one VM per host (lets
say, a PCI pass through thing), then thats another iteration of each
flavor. So assuming 3
flavors:

1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1
2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2
3=large cpu=8,ram=16384,disk=200gb,rxtx=2

This means you have these queues:

reserve
release
compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1
compute,cpu=1,ram=1024m,disk=5gb,rxtx=1
compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1
compute,cpu=2,ram=4096m,disk=100gb,rxtx=2
compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1
compute,cpu=8,ram=16384,disk=200gb,rxtx=2

<snip>

Now, I've made this argument in the past, and people have pointed out
that the permutations can get into the tens of thousands very easily
if you start adding lots of dimensions and/or flavors. I suggest that
is no big deal, but maybe I'm biased because I have done something like
that in Gearman and it was, in fact, no big deal.

Yeah, that's what I was worried about.  We have things that can be specified per
flavor, and things that can be specified per image, and things that can be
specified per instance, and they all multiply together.


So all that matters is the size of the set of permutations that people
are using _now_ to request nodes.  It's relatively low-cost to create
the queues in a distributed manner and just have compute nodes listen to
a broadcast for new ones that they should try to subscribe to. Even if
there are 1 million queues possible, it's unlikely there will be 1 million
legitimate unique boot arguments. This does complicate things quite a
bit though, so part of me just wants to suggest "don't do that".  ;)

2) How would it handle stuff like weight functions where you could have multiple
compute nodes that *could* satisfy the requirement but some of them would be
"better" than others by some arbitrary criteria.

Can you provide a concrete example? Feels like I'm asking for a straw
man to be built. ;)

Well, as an example we have a cluster that is aimed at high-performance network
processing and so all else being equal they will choose the compute node with
the least network traffic.  You might also try to pack instances together for
power efficiency (allowing you to turn off unused compute nodes), or choose the
compute node that results in the tightest packing (to minimize unused 
resources).


Least-utilized is hard since it requires knowledge of all of the nodes'
state. It also breaks down and gives 0 benefit when all the nodes are
fully bandwidth-utilized. However, "Below 20% utilized" is extremely
easy and achieves the actual goal that the user stated, since each node
can self-assess whether it is or is not in that group. In this way a user
gets given an error "I don't have any fully available networking for you"
instead of getting a node which is oversubscribed unknowingly.

Packing is kind of interesting. One can achieve it on an empty cluster
simply by only turning on one node at a time, and whenever the queue
has less than "safety_margin" workers, turn on more nodes. However,
once nodes are full and workloads are being deleted, you want to assess
which ones would be the least cost to migrate off of and turn off. I'm
inclined to say I would do this from something outside the scheduler,
as part of a power-reclaimer, but perhaps a centralized scheduler that
always knows would do a better job here. It would need to do that in
such a manner that is so efficient it would outweigh the benefit of not
needing global state awareness. An external reclaimer can work in an
eventually consistent manner and thus I would still lean toward that over
the realtime scheduler, but this needs some experimentation to confirm.

From what I've heard (idk how widely this is done in the industry); butactually turning off nodes I've heard causes more problems than itsolves in terms of power-costs, cooling, hardware [disk, cpu, other]failures and so-on, so maybe turning nodes off may not be the best idea.This is all things I've heard second-hand though so may not be whatothers do.

3) The biggest improvement I'd like to see is in group scheduling.  Suppose I
want to schedule multiple instances, each with their own resource requirements,
but also with interdependency between them (these ones on the same node, these
ones not on the same node, these ones with this provider network, etc.)  The
scheduler could then look at the whole request all at once and optimize it
rather than looking at each piece separately.  That could also allow relocating
multiple instances that want to be co-located on the same compute node.

So, if the grouping is arbitrary, then there's no way to pre-calculate the
group size, I agree. I am wont to pursue something like this though, as I
don't really think this is the kind of optimization that cloud workloads
should be built on top of. If you need two processes to have low latency,
why not just boot a bigger machine and do it all in one VM? There are a
few reasons I can think of, but I wonder how many are in the general
case?

It's a fair question. :)  I honestly don't know...I was just thinking that we
allow the expression of affinity/anti-affinity policies via server groups, but
the scheduler doesn't really do a good job of actually scheduling those groups.


For anti-affinity I'm still inclined to say "thats what availability
zones and regions are for". But I know that for smaller clouds, these
are too heavy weight in their current form to be useful. Perhaps we
could look at making them less-so.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Proposal for an Experiment

Reply via email to