Thanks for starting this, Joe.

I think that we need to address the operator and user experience by
improving the consistency and stability of OpenStack overall.  Here are
five ways of doing that:

1) Improve log correlation and utility

If we're going to improve the stability of OpenStack, we have to be
able to understand what's going on when it breaks.  That's both true
as developers when we're trying to diagnose a failure in an
integration test, and it's true for operators who are all too often
diagnosing the same failure in a real deployment.  Consistency in
logging across projects as well as a cross-project request token would
go a long way toward this.

2) Improve API consistency

As projects are becoming more integrated (which is happening at least
partially as we move functionality _out_ of previously monolithic
projects), the API between them becomes more important.  We keep
generating APIs with different expectations that behave in very
different ways across projects.  We need to standardize on API
behavior and expectations, for the sake of developers of OpenStack who
are increasingly using them internally, but even moreso for our users
who expect a single API and are bewildered when they get dozens

3) A real SDK

OpenStack is so nearly impossible to use, that we have a substantial
amount of code in the infrastructure program to do things that,
frankly, we are a bit surprised that the client libraries don't do.
Just getting an instance with an IP address is an enormous challenge,
and something that took us years to get right.  We still have problems
deleting instances.  We need client libraries (an SDK if you will) and
command line clients that are easy for users to understand and work
with, and hide the gory details of how the sausage is made.

In OpenStack, we have chosen to let a thousand flowers bloom and
deployers have a wide array of implementation options available.
However, it's unreasonable to expect all of our users to understand
all of the implications of all of those choices.  Our SDK must help
users deal with that complexity.

4) Reliability

Parts of OpenStack break all the time.  In general, we accept that the
environment a cloud operates in can be unreliable (we design for
failure).  However, that should be the exception, not the norm.  Our
current failure modes and rates are hurting everyone -- developers
merging changes in the gate, operators in continual fire-fighting
mode, and users who have to handle and recover from every kind of
internal error that OpenStack externalizes.  We need to focus on
making OpenStack itself operate reliably.

5) Functional testing

We've hit the limit of what we can reasonably accomplish by putting
all of our testing efforts into cross-project integration testing.
Instead, we need to functionally test individual projects much more
strongly, so that we can reserve integration testing (which is much
more complicated) for catching real "integration" bugs rather than
expecting it to call all functional bugs.  To that end, we should help
projects focus on robust functional testing in the Kilo cycle.


OpenStack-dev mailing list

Reply via email to