Hi, ALL Recently I've been working on two blueprints[1][2], both involved with recording scheduling information. And would like to hear some comments for several design choices.
Problem Statement -- * NoValidHost exception might masked out real failure reason to spin up an instance. Consider following event sequence, "run_instance" on host1 failed to spin up an instance due to port allocation failure in neutron. The request casted back to scheduler to pick next available host. It failed again on host2 for the same reason of port allocation error. After Maximum 3 times to retry, instance is set in "ERROR" state with a NoValidHost exception. And there's no easy way to find out what is really going wrong. * Current scheduling information are recorded in several different log items, which is difficult to lookup when debugging. Design Proposal -- 1. Blueprint internal-scheduler[1] will try to address problem #1. After conductor retrieved selected destination hosts from scheduler, it will create a "scheduler_records_allocations" item in database, for each allocated instance/host allocation. Design choices: a) Correlate this scheduler_records_allocations with the 'create' instance action, and generate a combined view with instance-action events. b) Add separate new API to retrieve this information. I prefer the choice #a, because instance action events perfectly fits such usage case. And allocation records will supplement necessary information when viewing 'create' action events of an instance. Thoughts? NOTE: Please find the following chart in link[3], in case of any format/display issue. scheduler_records_allocations +-----------------------------+ |allocation_id: 9001 | |instance_uuid: inst1_uuid | scheduler_records |scheduler_record_id: 1210 | +------------------------------+ |host: host1 | |scheduler_record_id: 1210 | |weight: 197.0 | +---------------+ |user_id: 'u_fakeid' | |result: Failed | |instance1 | |project_id: 'p_fakeid' | |reason: 'No more IP addresses| +---------------+ |request_id: 'req-xxx' | +-----------------------------+ |instance_uuids: [ | +-----------------------------+ +---------------+ | 'inst1_uuid', | |allocation_id: 9002 | |instance2 | | 'inst2_uuid'] | |instance_uuid: inst2_uuid | +---------------+ |request_spec: {...} | |scheduler_record_id: 1210 | |filter_properties: {...} | |host: host2 | |scheduler_records_allocations:| |weight: 128.0 | | [9001, 9002] | |result: Success | |start_time: ... | |reason: | |finish_time: ... | +-----------------------------+ +------------------------------+ +-----------------------------+ |allocation_id: 9003 | |instance_uuid: inst1_uuid | |scheduler_record_id: 1210 | |host: host2 | |weight: 64.0 | |result: Failed | |reason: 'No more IP addresses| +-----------------------------+ 2. Blueprint record-scheduler-information[2] will try to solve the problem #2, to generate a structured information for each scheduler run. Design choices: a) Record 'scheduler_records' info in database, which is easy to query, but introduce a great burden in terms of performance, extra database space usage, clean up/archiving policy, security relate issue[4], etc. b) Record 'scheduler_records' into a separate log file, in JSON format, and each line for a single record of each scheduler run. And then add a new API extension to retrieve last n (as a query parameter) scheduler records. The benefit of this approach avoided database issue, and plays well with external tooling, as well as provide a central place to view the log. But as a compromise, we won't be able to query logs for specific request_id. So the problem here is, is database storage solution still desirable? Or... implement backend driver which deployer could choose? However, in such case, API would be the minimum set to support both. Any comments or thoughts are highly appreciated. [1] https://blueprints.launchpad.net/nova/+spec/internal-scheduler [2] https://blueprints.launchpad.net/nova/+spec/record-scheduler-information [3] https://docs.google.com/document/d/1EsSNeq_tD-3NiX4IphCrQj4ii0_dO-8-Jn7NWHRJPNg/edit?usp=sharing [4] https://bugs.launchpad.net/nova/+bug/1175193 Thanks, -- Qiu Yu
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev