On 11/09/13 05:51, Adrian Otto wrote:
I have a different point of view. First I will offer some assertions:
It's not clear to me what you actually have an issue with? (Top-posting
is not helping in this respect.)
A-1) We need to keep it simple.
A-1.1) Systems that are hard to comprehend are hard to debug, and
that's bad.
Absolutely, and systems with higher entropy are harder to comprehend.
A-1.2) Complex systems tend to be much more brittle than simple ones.
"The Zen of Python" has it right here:
Simple is better than complex.
Complex is better than complicated.
Complicated systems have a lot of entropy. Complex systems (that is to
say, systems composed of multiple simpler systems) are actually a tool
for _reducing_ entropy.
A-2) Scale-up operations need to be as-fast-as-possible.
A-2.1) Auto-Scaling only works right if your new capacity is added
quickly when your controller detects that you need more. If you spend a bunch
of time goofing around before actually adding a new resource to a pool when its
under staring.
A-2.2) The fewer network round trips between "add-more-resources-now" and
"resources-added" the better. Fewer = less brittle.
I submit that the difference between a packet round-trip time within a
single datacenter and the time to boot a Nova server is at least 3
orders of magnitude.
A-3) The control logic for scaling different applications vary.
A-3.1) What metrics are watched may differ between various use cases.
A-3.2) The data types that represent sensor data may vary.
A-3.3) The policy that's applied to the metrics (such as max, min, and
cooldown period) vary between applications. Not only the values vary, but the
logic itself.
A-3.4) A scaling policy may not just be a handful of simple parameters.
Ideally it allows configurable logic that the end-user can control to some
extent.
A-4) Auto-scale operations are usually not orchestrations. They are usually
simple linear workflows.
Well, one of the things Chris wants to do with this is to scale whole
templates instead of just Nova servers.
A-4.1) The Taskflow project[1] offers a simple way to do workflows and
stable state management that can be integrated directly into Autoscale.
A-4.2) A task flow (workflow) can trigger a Heat orchestration if
needed.
If you're re-proposing Chris's original thought of having to different
ways to do autoscaling depending on whether it's for individual
instances or whole templates, then I fail to see how that is in any
sense simpler than having only one way that handles everything.
Now a mental tool to think about control policies:
Auto-scaling is like steering a car. The control policy says that you want to
drive equally between the two lane lines, and that if you drift off center, you
gradually correct back toward center again. If the road bends, you try to
remain in your lane as the lane lines curve. You try not to weave around in
your lane, and you try not to drift out of the lane.
OK, in the sense that both are proportional control systems, sure.
(Though in autoscaling, unlike the car, both the feedback loop and the
response have significant non-linearities.)
If your controller notices that you are about to drift out of your lane because
the road is starting to bend, and you are distracted, or your hands slip off
the wheel, you might drift out of your lane into nearby traffic. That's why you
don't want a Rube Goldberg Machine[2] between you and the steering wheel. See
assertions A-1 and A-2.
But you probably do want a power steering device between the wheel and
the steering rack. I think this metaphor is ready for the scrapheap ;)
There was (IMHO) a Rube Goldberg-like device proposed in this thread,
but not by me :D
If you are driving an 18-wheel tractor/trailer truck, steering is different
than if you are driving a Fiat. You need to wait longer and steer toward the
outside of curves so your trailer does not lag behind on the inside of the
curve behind you as you correct for a bend in the road. When you are driving
the Fiat, you may want to aim for the middle of the lane at all times, possibly
even apexing bends to reduce your driving distance, which is actually the
opposite of what truck drivers need to do. Control policies apply to other
parts of driving too. I want a different policy for braking than I use for
steering. On some vehicles I go through a gear shifting workflow, and on others
I don't. See assertion A-3.
Right, PID control systems are more general.
The idea of allowing the user to substitute their own scaling policy
engine has always been on the road map since you and others raised it at
Summit, though, and it's orthogonal to the parts of the design you're
questioning below. So I'm not really sure what you're, uh, driving at
(no pun intended).
So, I don't intend to argue the technical minutia of each design point, but I
challenge you to make sure that we (1) arrive at a simple system that any
OpenStack user can comprehend, (2) responds quickly to alarm stimulus, (3) is
unlikely to fail, (4) can be easily customized with user-supplied logic that
controls how the scaling happens, and under what conditions.
I disagree with (3); systems should be designed to cope gracefully in
the event of their _inevitable_ failure.
It would be better if we could explain Autoscale like this:
Heat -> Autoscale -> Nova, etc.
-or-
User -> Autoscale -> Nova, etc.
Let's explain it like that then. The use of the Heat by the autoscaling
back-end is entirely an implementation detail, and the user should never
need to know about it. It was mentioned only because this was a thread
about implementation details.
This approach allows use cases where (for whatever reason) the end user does
not want to use Heat at all, but still wants something simple to be auto-scaled
for them. Nobody would be scratching their heads wondering why things are going
in circles.
It's irrelevant to the user whether the cloud operator implements
autoscaling with Heat or not.
From an implementation perspective, that means the auto-scale service needs at
least a simple linear workflow capability in it that may trigger a Heat
orchestration if there is a good reason for it. This way, the typical use cases
don't have anything resembling circular dependencies. The source of truth for
how many members are currently in an Autoscaling group should be the Autoscale
service, not in the Heat database. If you want to expose that in
list-stack-resources output, then cause Heat to call out to the Autoscale
service to fetch that figure as needed. It is irrelevant to orchestration. Code
does not need to be duplicated. Both Autoscale and Heat can use the same exact
source code files for the code that launches/terminates instances of resources.
So, it sounds like you want to incorporate the Heat code in Autoscaling
by loading it as a library instead of using it as a service?
I guess that's pretty much what we do now, but going down this path
means that the code will be forever stuck in the same project (i.e.
repository), and we would lose the option to split Autoscaling out as a
separate project within the Orchestration program.
Secondly, interacting with systems only via defined and tested APIs
reduces the entropy of the resulting system compared with direct access
to the internals. It's the difference between complex systems and
complicated ones. So IMO this idea fails the tests that you set for it,
for a gain of... 30ms of latency?
cheers,
Zane.
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev