Thanks for the explanation!

I'm going to claim that the thread revolves around two main areas of 
disagreement.  Then I'm going
to propose a way through:

a) Manual Node Assignment

I think that everyone is agreed that automated node assignment through 
nova-scheduler is by
far the most ideal case; there's no disagreement there.

The disagreement comes from whether we need manual node assignment or not.  I 
would argue that we
need to step back and take a look at the real use case: heterogeneous nodes.  
If there are literally
no characteristics that differentiate nodes A and B, then why do we care which 
gets used for what?  Why
do we need to manually assign one?

This is a better way of verbalizing my concerns. I suspect there are going to be quite a few heterogeneous environments built from legacy pieces in the near term and fewer built from the ground up with all new matching hotness.

On the other side of it, instead of handling legacy hardware I was worried about the new hotness (not sure why I keep using that term) specialized for a purpose. This is exactly what Robert described in his GPU example. I think his explanation of how to use the scheduler to accommodate that makes a lot of sense, so I'm much less behind the idea of a strict manual assignment than I previously was.

If we can agree on that, then I think it would be sufficient to say that we 
want a mechanism to allow
UI users to deal with heterogeneous nodes, and that mechanism must use 
nova-scheduler.  In my mind,
that's what resource classes and node profiles are intended for.

One possible objection might be: nova scheduler doesn't have the appropriate 
filter that we need to
separate out two nodes.  In that case, I would say that needs to be taken up 
with nova developers.


b) Terminology

It feels a bit like some of the disagreement come from people using different 
words for the same thing.
For example, the wireframes already details a UI where Robert's roles come 
first, but I think that message
was confused because I mentioned "node types" in the requirements.

So could we come to some agreement on what the most exact terminology would be? 
 I've listed some examples below,
but I'm sure there are more.

node type | role
management node | ?
resource node | ?
unallocated | available | undeployed
create a node distribution | size the deployment
resource classes | ?
node profiles | ?

Mainn

----- Original Message -----
On 10 December 2013 09:55, Tzu-Mainn Chen <tzuma...@redhat.com> wrote:
        * created as part of undercloud install process

By that note I meant, that Nodes are not resources, Resource instances
run on Nodes. Nodes are the generic pool of hardware we can deploy
things onto.

I don't think "resource nodes" is intended to imply that nodes are
resources; rather, it's supposed to
indicate that it's a node where a resource instance runs.  It's supposed to
separate it from "management node"
and "unallocated node".

So the question is are we looking at /nodes/ that have a /current
role/, or are we looking at /roles/ that have some /current nodes/.

My contention is that the role is the interesting thing, and the nodes
is the incidental thing. That is, as a sysadmin, my hierarchy of
concerns is something like:
  A: are all services running
  B: are any of them in a degraded state where I need to take prompt
action to prevent a service outage [might mean many things: - software
update/disk space criticals/a machine failed and we need to scale the
cluster back up/too much load]
  C: are there any planned changes I need to make [new software deploy,
feature request from user, replacing a faulty machine]
  D: are there long term issues sneaking up on me [capacity planning,
machine obsolescence]

If we take /nodes/ as the interesting thing, and what they are doing
right now as the incidental thing, it's much harder to map that onto
the sysadmin concerns. If we start with /roles/ then can answer:
  A: by showing the list of roles and the summary stats (how many
machines, service status aggregate), role level alerts (e.g. nova-api
is not responding)
  B: by showing the list of roles and more detailed stats (overall
load, response times of services, tickets against services
      and a list of in trouble instances in each role - instances with
alerts against them - low disk, overload, failed service,
early-detection alerts from hardware
  C: probably out of our remit for now in the general case, but we need
to enable some things here like replacing faulty machines
  D: by looking at trend graphs for roles (not machines), but also by
looking at the hardware in aggregate - breakdown by age of machines,
summary data for tickets filed against instances that were deployed to
a particular machine

C: and D: are (F) category work, but for all but the very last thing,
it seems clear how to approach this from a roles perspective.

I've tried to approach this using /nodes/ as the starting point, and
after two terrible drafts I've deleted the section. I'd love it if
someone could show me how it would work:)

     * Unallocated nodes

This implies an 'allocation' step, that we don't have - how about
'Idle nodes' or something.

It can be auto-allocation. I don't see problem with 'unallocated' term.

Ok, it's not a biggy. I do think it will frame things poorly and lead
to an expectation about how TripleO works that doesn't match how it
does, but we can change it later if I'm right, and if I'm wrong, well
it won't be the first time :).


I'm interested in what the distinction you're making here is.  I'd rather
get things
defined correctly the first time, and it's very possible that I'm missing a
fundamental
definition here.

So we have:
  - node - a physical general purpose machine capable of running in
many roles. Some nodes may have hardware layout that is particularly
useful for a given role.
  - role - a specific workload we want to map onto one or more nodes.
Examples include 'undercloud control plane', 'overcloud control
plane', 'overcloud storage', 'overcloud compute' etc.
  - instance - A role deployed on a node - this is where work actually
  happens.
  - scheduling - the process of deciding which role is deployed on which node.

The way TripleO works is that we defined a Heat template that lays out
policy: 5 instances of 'overcloud control plane please', '20
hypervisors' etc. Heat passes that to Nova, which pulls the image for
the role out of Glance, picks a node, and deploys the image to the
node.

Note in particular the order: Heat -> Nova -> Scheduler -> Node chosen.

The user action is not 'allocate a Node to 'overcloud control plane',
it is 'size the control plane through heat'.

So when we talk about 'unallocated Nodes', the implication is that
users 'allocate Nodes', but they don't: they size roles, and after
doing all that there may be some Nodes that are - yes - unallocated,
or have nothing scheduled to them. So... I'm not debating that we
should have a list of free hardware - we totally should - I'm debating
how we frame it. 'Available Nodes' or 'Undeployed machines' or
whatever. I just want to get away from talking about something
([manual] allocation) that we don't offer.

-Rob

--
Robert Collins <rbtcoll...@hp.com>
Distinguished Technologist
HP Converged Cloud

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to