On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
+1 on the “no valid host found”, this one should be at the very top of the to-be-fixed list.

Very difficult to troubleshoot filters in lab testing (let alone in production) as there can be many of them. This will get worst with more NFV optimization filters with so many combinations it gets really complex to debug when a VM cannot be launched with NFV optimized flavors. With the scheduler filtering engine, there should be a systematic way to log the reason for not finding a valid host - at the very least the error should display which filter failed as an ERROR (and not as DEBUG).

We really need to avoid deploying with DEBUG log level but unfortunately this is the only way to troubleshoot. Too many debug-level logs are for development debug (meaning pretty much useless in any circumstances – developer forgot to remove before commit of the feature), many errors that should be logged as ERROR but have been logged as DEBUG only.


The scheduler logs do print which filter returned 0 hosts for a given request at INFO level.

For example, I was debugging this NoValidHost failure in a CI run:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999

And tracing the request ID to the scheduler logs, and filtering on just INFO level logging to simulate what you'd have for the default in production, I found the filter that kicked it out here:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582

And there is a summary like this:

Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073 nova-scheduler[8932]: INFO nova.filters [None req-737984ae-3ae8-4506-a5f9-6655a4ebf206 tempest-ServersAdminTestJSON-787960229 tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter: (start: 1, end: 0)']

"end: 0" means that's the filter that rejected the request. Without digging into the actual code, the descriptions for the filters is here:

https://docs.openstack.org/nova/latest/user/filter-scheduler.html

Now just why this request failed requires a bit of understanding of why my environment looks like (this CI run is using the CachingScheduler), so there isn't a simple "ERROR: SameHostFilter rejected request because you're using the CachingScheduler which is racy by design and you created the instances in separate requests". You'd have a ton of false negative ERRORs in the logs because of valid reasons for rejected a request based on the current state of the system, which is going to make debugging real issues that much harder.

--

Thanks,

Matt

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to