Thanks Sean for writing up this report, greatly appreciated.
Comments inline.

Le 11/03/2015 13:59, Sean Dague a écrit :
The last couple of days I was at the Operators Meetup acting as Nova
rep for the meeting. All the sessions were quite nicely recorded to
etherpads here -

There was both a specific Nova session - as well as a
bunch of relevant pieces of information in other sessions.

This is an attempt for some summary here, anyone else that was in
attendance please feel free to correct if I'm interpreting something
incorrectly. There was a lot of content there, so this is in no way
comprehensive list, just the highlights that I think make the most
sense for the Nova team.

  Nova Network -> Neutron

This remains listed as the #1 issue from the Operator Community on
their burning issues list
( L18). During
the tags conversation we straw polled the audience
( L45) and about 75% of
attendees were over on neutron already. However those on Nova Network
we disproportionally the largest clusters and longest standing
OpenStack users.

Of those on nova-network about 1/2 had no interest in being on
Neutron (
L24). Some of the primary reasons were the following:

- Complexity concerns - neutron has a lot more moving parts
- Performance concerns - nova multihost means there is very little
   between guests and the fabric, which is really important for the HPC
   workload use case for OpenStack.
- Don't want OVS - ovs adds additional complexity, and performance
   concerns. Many large sites are moving off ovs back to linux bridge
   with neutron because they are hitting OVS scaling limits (especially
   if on UDP) - ( L142)

The biggest disconnect in the model seems to be that Neutron assumes
you want self service networking. Most of these deploys don't. Or even
more importantly, they live in an organization where that is never
going to be an option.

Neutron provider networks is close, except it doesn't provide for
floating IP / NAT.

Going forward: I think the gap analysis probably needs to be revisited
with some of the vocal large deployers. I think we assumed the
functional parity gap was closed with DVR, but it's not clear in it's
current format it actually meets the n-net multihost users needs.

  EC2 going forward

Having a sustaninable EC2 is of high interest to the operator
community. Many large deploys have some users that were using AWS
prior to using OpenStack, or currently are using both. They have
preexisting tooling for that.

There didn't seem to be any objection to the approach of an external
proxy service for this function -
( L111). Mostly
the question is timing, and the fact that no one has validated the
stackforge project. The fact that we landed everything people need to
run this in Kilo is good, as these production deploys will be able to
test it for their users when they upgrade.

  Burning Nova Features/Bugs

Hierarchical Projects Quotas

Hugely desired feature by the operator community
( L116). Missed
Kilo. This made everyone sad.

Action: we should queue this up as early Liberty priority item.

Out of sync Quotas
------------------ L63

The quotas code is quite racey (this is kind of a known if you look at
the bug tracker). It was actually marked as a top soft spot during
last fall's bug triage -

There is an operator proposed spec for an approach here -

Action: we should make a solution here a top priority for enhanced
testing and fixing in Liberty. Addressing this would remove a lot of
pain from ops.

Reporting on Scheduler Fails

Apparently, some time recently, we stopped logging scheduler fails
above DEBUG, and that behavior also snuck back into Juno as well
( L78). This
has made tracking down root cause of failures far more difficult.

Action: this should hopefully be a quick fix we can get in for Kilo
and backport.
It's unfortunate that failed scheduling attempts are providing only an INFO log. A quick fix could be at least to turn the verbosity up to WARN so it would be noticied more easily (including the whole filters stack with their results). That said, I'm pretty against any proposal which would expose those specific details (ie. the number of hosts which are succeeding per filter) in an API endpoint because it would also expose the underlying infrastructure capacity and would ease DoS discoveries. A workaround could be to include in the ERROR message only the name of the filter which has been denied so the operators could very easily match what the user is saying with what they're seeing in the scheduler logs.

Does that work for people ? I can provide changes for both.


  Additional Interesting Bits


There was a whole session on Rabbit -

Rabbit is a top operational concern for most large sites. Almost all
sites have a "restart everything that talks to rabbit" script because
during rabbit ha opperations queues tend to blackhole.

All other queue systems OpenStack supports are worse than Rabbit (from
experience in that room).

oslo.messaging < 1.6.0 was a significant regression in dependability
from the incubator code. It now seems to be getting better but still a
lot of issues. (L112)

Operators *really* want the concept in landed. (I asked them to
provide such feedback in gerrit).

Nova Rolling Upgrades

Most people really like the concept, couldn't find anyone that had
used it yet because Neutron doesn't support it, so they had to big
bang upgrades anyway.

Galera Upstream Testing

The majority of deploys run with Galera MySQL. There was a question
about whether or not we could get that into upstream testing pipeline
as that's the common case.


OpenStack Development Mailing List (not for usage questions)

Reply via email to