On 11/09/13 05:51, Adrian Otto wrote:
I have a different point of view. First I will offer some assertions:

It's not clear to me what you actually have an issue with? (Top-posting is not helping in this respect.)

A-1) We need to keep it simple.
        A-1.1) Systems that are hard to comprehend are hard to debug, and 
that's bad.

Absolutely, and systems with higher entropy are harder to comprehend.

        A-1.2) Complex systems tend to be much more brittle than simple ones.

"The Zen of Python" has it right here:

    Simple is better than complex.
    Complex is better than complicated.

Complicated systems have a lot of entropy. Complex systems (that is to say, systems composed of multiple simpler systems) are actually a tool for _reducing_ entropy.

A-2) Scale-up operations need to be as-fast-as-possible.
        A-2.1) Auto-Scaling only works right if your new capacity is added 
quickly when your controller detects that you need more. If you spend a bunch 
of time goofing around before actually adding a new resource to a pool when its 
under staring.
        A-2.2) The fewer network round trips between "add-more-resources-now" and 
"resources-added" the better. Fewer = less brittle.

I submit that the difference between a packet round-trip time within a single datacenter and the time to boot a Nova server is at least 3 orders of magnitude.

A-3) The control logic for scaling different applications vary.
        A-3.1) What metrics are watched may differ between various use cases.
        A-3.2) The data types that represent sensor data may vary.
        A-3.3) The policy that's applied to the metrics (such as max, min, and 
cooldown period) vary between applications. Not only the values vary, but the 
logic itself.
        A-3.4) A scaling policy may not just be a handful of simple parameters. 
Ideally it allows configurable logic that the end-user can control to some 
extent.

A-4) Auto-scale operations are usually not orchestrations. They are usually 
simple linear workflows.

Well, one of the things Chris wants to do with this is to scale whole templates instead of just Nova servers.

        A-4.1) The Taskflow project[1] offers a simple way to do workflows and 
stable state management that can be integrated directly into Autoscale.
        A-4.2) A task flow (workflow) can trigger a Heat orchestration if 
needed.

If you're re-proposing Chris's original thought of having to different ways to do autoscaling depending on whether it's for individual instances or whole templates, then I fail to see how that is in any sense simpler than having only one way that handles everything.

Now a mental tool to think about control policies:

Auto-scaling is like steering a car. The control policy says that you want to 
drive equally between the two lane lines, and that if you drift off center, you 
gradually correct back toward center again. If the road bends, you try to 
remain in your lane as the lane lines curve. You try not to weave around in 
your lane, and you try not to drift out of the lane.

OK, in the sense that both are proportional control systems, sure. (Though in autoscaling, unlike the car, both the feedback loop and the response have significant non-linearities.)

If your controller notices that you are about to drift out of your lane because 
the road is starting to bend, and you are distracted, or your hands slip off 
the wheel, you might drift out of your lane into nearby traffic. That's why you 
don't want a Rube Goldberg Machine[2] between you and the steering wheel. See 
assertions A-1 and A-2.

But you probably do want a power steering device between the wheel and the steering rack. I think this metaphor is ready for the scrapheap ;)

There was (IMHO) a Rube Goldberg-like device proposed in this thread, but not by me :D

If you are driving an 18-wheel tractor/trailer truck, steering is different 
than if you are driving a Fiat. You need to wait longer and steer toward the 
outside of curves so your trailer does not lag behind on the inside of the 
curve behind you as you correct for a bend in the road. When you are driving 
the Fiat, you may want to aim for the middle of the lane at all times, possibly 
even apexing bends to reduce your driving distance, which is actually the 
opposite of what truck drivers need to do. Control policies apply to other 
parts of driving too. I want a different policy for braking than I use for 
steering. On some vehicles I go through a gear shifting workflow, and on others 
I don't. See assertion A-3.

Right, PID control systems are more general.

The idea of allowing the user to substitute their own scaling policy engine has always been on the road map since you and others raised it at Summit, though, and it's orthogonal to the parts of the design you're questioning below. So I'm not really sure what you're, uh, driving at (no pun intended).

So, I don't intend to argue the technical minutia of each design point, but I 
challenge you to make sure that we (1) arrive at a simple system that any 
OpenStack user can comprehend, (2) responds quickly to alarm stimulus, (3) is 
unlikely to fail, (4) can be easily customized with user-supplied logic that 
controls how the scaling happens, and under what conditions.

I disagree with (3); systems should be designed to cope gracefully in the event of their _inevitable_ failure.

It would be better if we could explain Autoscale like this:

Heat -> Autoscale -> Nova, etc.
-or-
User -> Autoscale -> Nova, etc.

Let's explain it like that then. The use of the Heat by the autoscaling back-end is entirely an implementation detail, and the user should never need to know about it. It was mentioned only because this was a thread about implementation details.

This approach allows use cases where (for whatever reason) the end user does 
not want to use Heat at all, but still wants something simple to be auto-scaled 
for them. Nobody would be scratching their heads wondering why things are going 
in circles.

It's irrelevant to the user whether the cloud operator implements autoscaling with Heat or not.

 From an implementation perspective, that means the auto-scale service needs at 
least a simple linear workflow capability in it that may trigger a Heat 
orchestration if there is a good reason for it. This way, the typical use cases 
don't have anything resembling circular dependencies. The source of truth for 
how many members are currently in an Autoscaling group should be the Autoscale 
service, not in the Heat database. If you want to expose that in 
list-stack-resources output, then cause Heat to call out to the Autoscale 
service to fetch that figure as needed. It is irrelevant to orchestration. Code 
does not need to be duplicated. Both Autoscale and Heat can use the same exact 
source code files for the code that launches/terminates instances of resources.

So, it sounds like you want to incorporate the Heat code in Autoscaling by loading it as a library instead of using it as a service?

I guess that's pretty much what we do now, but going down this path means that the code will be forever stuck in the same project (i.e. repository), and we would lose the option to split Autoscaling out as a separate project within the Orchestration program.

Secondly, interacting with systems only via defined and tested APIs reduces the entropy of the resulting system compared with direct access to the internals. It's the difference between complex systems and complicated ones. So IMO this idea fails the tests that you set for it, for a gain of... 30ms of latency?

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to