On 08/02/2018 04:10 AM, Chris Dent wrote:
When people ask for something like what Chris mentioned:
hosts with enough CPU: <list1>
hosts that also have enough disk: <list2>
hosts that also have enough memory: <list3>
hosts that also meet extra spec host aggregate keys: <list 4>
hosts that also meet image properties host aggregate keys: <list 5>
hosts that also have requested PCI devices: <list 6>
What are the operational questions that people are trying to answer
with those results? Is the idea to be able to have some insight into
the resource usage and reporting on and from the various hosts and
discover that things are being used differently than thought? Is
placement a resource monitoring tool, or is it more simple and
focused than that? Or is it that we might have flavors or other
resource requesting constraints that have bad logic and we want to
see at what stage the failure is? I don't know and I haven't really
seen it stated explicitly here, and knowing it would help.
Do people want info like this for requests as they happen, or to be
able to go back later and try the same request again with some flag
on that says: "diagnose what happened"?
Or to put it another way: Before we design something that provides
the information above, which is a solution to an undescribed
problem, can we describe the problem more completely first to make
sure that what solution we get is the right one. The thing above,
that set of information, is context free.
The reason my organization added additional failure-case logging to the
pre-placement scheduler was that we were enabling complex features (cpu pinning,
hugepages, PCI, SRIOV, CPU model requests, NUMA topology, etc.) and we were
running into scheduling failures, and people were asking the question "why did
this scheduler request fail to find a valid host?".
There are a few reasons we might want to ask this question. Some of them
1) double-checking the scheduler is working properly when first using additional
2) weeding out images/flavors with excessive or mutually-contradictory
3) determining whether the cluster needs to be reconfigured to meet user
I suspect that something like "do the same request again with a debug flag"
would cover many scenarios. I suspect its main weakness would be dealing with
contention between short-lived entities.
OpenStack Development Mailing List (not for usage questions)