On 09/14/2015 02:41 PM, Flavio Percoco wrote:
On 14/09/15 08:10 -0400, Doug Hellmann wrote:

After having some conversations with folks at the Ops Midcycle a
few weeks ago, and observing some of the more recent email threads
related to glance, glance-store, the client, and the API, I spent
last week contacting a few of you individually to learn more about
some of the issues confronting the Glance team. I had some very
frank, but I think constructive, conversations with all of you about
the issues as you see them. As promised, this is the public email
thread to discuss what I found, and to see if we can agree on what
the Glance team should be focusing on going into the Mitaka summit
and development cycle and how the rest of the community can support
you in those efforts.

I apologize for the length of this email, but there's a lot to go
over. I've identified 2 high priority items that I think are critical
for the team to be focusing on starting right away in order to use
the upcoming summit time effectively. I will also describe several
other issues that need to be addressed but that are less immediately
critical. First the high priority items:

1. Resolve the situation preventing the DefCore committee from
  including image upload capabilities in the tests used for trademark
  and interoperability validation.

2. Follow through on the original commitment of the project to
  provide an image API by completing the integration work with
  nova and cinder to ensure V2 API adoption.

Hi Doug,

First and foremost, I'd like to thank you for taking the time to dig
into these issues, and for reaching out to the community seeking for
information and a better understanding of what the real issues are. I
can imagine how much time you had to dedicate on this and I'm glad you
did.

Ditto. Thanks so much for the work Doug!

Now, to your email, I very much agree with the priorities you
mentioned above and I'd like for, whomever will win Glance's PTL
election, to bring focus back on that.

Please, find some comments in-line for each point:



I. DefCore

The primary issue that attracted my attention was the fact that
DefCore cannot currently include an image upload API in its
interoperability test suite, and therefore we do not have a way to
ensure interoperability between clouds for users or for trademark
use. The DefCore process has been long, and at times confusing,
even to those of us following it sort of closely. It's not entirely
surprising that some projects haven't been following the whole time,
or aren't aware of exactly what the whole thing means. I have
proposed a cross-project summit session for the Mitaka summit to
address this need for communication more broadly, but I'll try to
summarize a bit here.

+1

I think it's quite sad that some projects, especially those considered
to be part of the `starter-kit:compute`[0], don't follow closely
what's going on in DefCore. I personally consider this a task PTLs
should incorporate in their role duties. I'm glad you proposed such
session, I hope it'll help raising awareness of this effort and it'll
help moving things forward on that front.



DefCore is using automated tests, combined with business policies,
to build a set of criteria for allowing trademark use. One of the
goals of that process is to ensure that all OpenStack deployments
are interoperable, so that users who write programs that talk to
one cloud can use the same program with another cloud easily. This
is a *REST API* level of compatibility. We cannot insert cloud-specific
behavior into our client libraries, because not all cloud consumers
will use those libraries to talk to the services. Similarly, we
can't put the logic in the test suite, because that defeats the
entire purpose of making the APIs interoperable. For this level of
compatibility to work, we need well-defined APIs, with a long support
period, that work the same no matter how the cloud is deployed. We
need the entire community to support this effort. From what I can
tell, that is going to require some changes to the current Glance
API to meet the requirements. I'll list those requirements, and I
hope we can discuss them to a degree that ensures everyone understands
them. I don't want this email thread to get bogged down in
implementation details or API designs, though, so let's try to keep
the discussion at a somewhat high level, and leave the details for
specs and summit discussions. I do hope you will correct any
misunderstandings or misconceptions, because unwinding this as an
outside observer has been quite a challenge and it's likely I have
some details wrong.

As I understand it, there are basically two ways to upload an image
to glance using the V2 API today. The "POST" API pushes the image's
bits through the Glance API server, and the "task" API instructs
Glance to download the image separately in the background. At one
point apparently there was a bug that caused the results of the two
different paths to be incompatible, but I believe that is now fixed.
However, the two separate APIs each have different issues that make
them unsuitable for DefCore.

The DefCore process relies on several factors when designating APIs
for compliance. One factor is the technical direction, as communicated
by the contributor community -- that's where we tell them things
like "we plan to deprecate the Glance V1 API". In addition to the
technical direction, DefCore looks at the deployment history of an
API. They do not want to require deploying an API if it is not seen
as widely usable, and they look for some level of existing adoption
by cloud providers and distributors as an indication of that the
API is desired and can be successfully used. Because we have multiple
upload APIs, the message we're sending on technical direction is
weak right now, and so they have focused on deployment considerations
to resolve the question.

The task upload process you're referring to is the one that uses the
`import` task, which allows you to download an image from an external
source, asynchronously, and import it in Glance. This is the old
`copy-from` behavior that was moved into a task.

The "fun" thing about this - and I'm sure other folks in the Glance
community will disagree - is that I don't consider tasks to be a
public API. That is to say, I would expect tasks to be an internal API
used by cloud admins to perform some actions (bsaed on its current
implementation). Eventually, some of these tasks could be triggered
from the external API but as background operations that are triggered
by the well-known public ones and not through the task API.

Ultimately, I believe end-users of the cloud simply shouldn't care
about what tasks are or aren't and more importantly, as you mentioned
later in the email, tasks make clouds not interoperable. I'd be pissed
if my public image service would ask me to learn about tasks to be
able to use the service.

Long story short, I believe the only upload API that should be
considered is the one that uses HTTP and, eventually, to bring
compatibility with v1 as far as the copy-from behavior goes, Glance
could bring back that behavior on top of the task (just dropping this
here for the sake of discussion and interoperability).

Yes. 1000x yes.

The POST API is enabled in many public clouds, but not consistently.
In some clouds like HP, a tenant requires special permission to use
the API. At least one provider, Rackspace, has disabled the API
entirely. This is apparently due to what seems like a fair argument
that uploading the bits directly to the API service presents a
possible denial of service vector. Without arguing the technical
merits of that decision, the fact remains that without a strong
consensus from deployers that the POST API should be publicly and
consistently available, it does not meet the requirements to be
used for DefCore testing.

This is definitely unfortunate. I believe a good step forward for this
discussion would be to create a list of issues related to uploading
images and see how those issues can be addressed. The result from that
work might be that it's not recommended to make that endpoint public
but again, without going through the issues, it'll be hard to
understand how we can improve this situation. I expect most of this
issues to have a security impact.


The task API is also not widely deployed, so its adoption for DefCore
is problematic. If we provide a clear technical direction that this
API is preferred, that may overcome the lack of adoption, but the
current task API seems to have technical issues that make it
fundamentally unsuitable for DefCore consideration. While the task
API addresses the problem of a denial of service, and includes
useful features such as processing of the image during import, it
is not strongly enough defined in its current form to be interoperable.
Because it's a generic API, the caller must know how to fully
construct each task, and know what task types are supported in the
first place. There is only one "import" task type supported in the
Glance code repository right now, but it is not clear that "import"
always uses the same arguments, or interprets them in the same way.
For example, the upstream documentation [1] describes a task that
appears to use a URL as source, while the Rackspace documentation [2]
describes a task that appears to take a swift storage location.
I wasn't able to find JSONSchema validation for the "input" blob
portion of the task in the code [3], though that may happen down
inside the task implementation itself somewhere.


The above sounds pretty accurate as there's currently just 1 flow that
can be triggered (the import flow) and that accepts an input, which is
a json. As I mentioned above, I don't believe tasks should be part of
the public API and this is yet another reason why I think so. The
tasks API is not well defined as there's, currently, not good way to
define the expected input in a backwards compatible way and to provide
all the required validation.

I like having tasks in Glance, despite my comments above - but I like
them for cloud usage and not public usage.

I like them much more if they're not public facing. They're not BAD - they just don't have an end-user semantic.

As far as Rackspace's docs/endpoint goes, I'd assume this is an error
in their documetation since Glance currently doesn't allow[0] for
swift URLs to be imported (not even in juno[1]).

[0]
http://git.openstack.org/cgit/openstack/glance/tree/glance/common/scripts/utils.py#n84

[1]
http://git.openstack.org/cgit/openstack/glance/tree/glance/common/scripts/utils.py?h=stable/juno#n83

Nope. You MUST upload the image to swift and then provide a swift location. (Infra does this in production, I promise it's the only thing that works)

Tasks also come from plugins, which may be installed differently
based on the deployment. This is an interesting approach to creating
API extensions, but isn't discoverable enough to write interoperable
tools against. Most of the other projects are starting to move away
from supporting API extensions at all because of interoperability
concerns they introduce. Deployers should be able to configure their
clouds to perform well, but not to behave in fundamentally different
ways. Extensions are just that, extensions. We can't rely on them
for interoperability testing.

This is, indeed, an interesting interpretation of what tasks are for.
I'd probably just blame us (Glance team) for not communicating
properly what tasks are meant to be. I don't believe tasks are a way
to extend the *public* API and I'd be curious to know if others see it
that way. I fully agree that just breaks interoperability and as I've
mentioned a couple of times in this reply already, I don't even think
tasks should be part of the public API.

But again, very poor job communicating so[0]. Nonetheless, for the
sake of providing enough information about tasks and sources to read
from, I'd also like to point out the original blueprint[1], some
discussions during the havana's summit[2], the wiki page for tasks[3]
and a patch I just reviewed today (thanks Brian) that introduces docs
for tasks[4]. These links show already some differences in what tasks
are.

[0]
http://git.openstack.org/cgit/openstack/glance/tree/etc/policy.json?h=stable/juno#n28

[1] https://blueprints.launchpad.net/glance/+spec/async-glance-workers
[2] https://etherpad.openstack.org/p/havana-glance-requirements
[3] https://wiki.openstack.org/wiki/Glance-tasks-api
[4] https://review.openstack.org/#/c/220166/


There is a lot of fuzziness around exactly what is supported for
image upload, both in the documentation and in the minds of the
developers I've spoken to this week, so I'd like to take a step
back and try to work through some clear requirements, and then we
can have folks familiar with the code help figure out if we have a
real issue, if a minor tweak is needed, or if things are good as
they stand today and it's all a misunderstanding.

1. We need a strongly defined and well documented API, with arguments
  that do not change based on deployment choices. The behind-the-scenes
  behaviors can change, but the arguments provided by the caller
  must be the same and the responses must look the same. The
  implementation can run as a background task rather than receiving
  the full image directly, but the current task API is too vaguely
  defined to meet this requirement, and IMO we need an entry point
  focused just on uploading or importing an image.

2. Glance cannot require having a Swift deployment. It's not clear
  whether this is actually required now, so if it's not then we're
  in a good state.

This is definitely not the case. Glance doesn't require any specific
store to be deployed. It does require at least one other than the http
one (because it doesn't support write operations).

Awesome.

It's fine to provide an optional way to take
  advantage of Swift if it is present, but it cannot be a required
  component. There are three separate trademark "programs", with
  separate policies attached to them. There is an umbrella "Platform"
  program that is intended to include all of the TC approved release
  projects, such as nova, glance, and swift. However, there is
  also a separate "Compute" program that is intended to include
  Nova, Glance, and some others but *not* Swift. This is an important
  distinction, because there are many use cases both for distributors
  and public cloud providers that do not incorporate Swift for a
  variety of reasons. So, we can't have Glance's primary configuration
  require Swift and we need to provide tests for the DefCore team
  that run without Swift. Duplicate tests that do use Swift are
  fine, and might be used for "Platform" compliance tests.

3. We need an integration test suite in tempest that fully exercises
  the public image API by talking directly to Glance. This applies
  to the entire API, not just image uploads. It's fine to have
  duplicate tests using the proxy in Nova if the Nova team wants
  those, but DefCore should be using tests that talk directly to
  the service that owns each feature, without relying on any
  proxying. We've already missed the chance to deal with this in
  the current DefCore definition, which uses image-related tests
  that talk to the Nova proxy [4][5], so we'll have to maintain
  the proxy for the required deprecation period. But we won't be
  able to consider removing that proxy until we provide alternate
  tests for those features that speak directly to Glance. We may
  have some coverage already, but I wasn't able to find a task-based
  image upload test and there is no "image create" mentioned in
  the current draft of capabilities being reviewed [6]. There may
  be others missing, so someone more familiar with the feature set
  of Glance should do an audit and document what tests are needed
  so the work can be split up.


+1 This should become one of the top priorities for Mitaka (as you
mentioned at the beginning of this email).

++

4. Once identified and incorporated into the DefCore capabilities
  set, the selected API needs to remain stable for an extended
  period of time and follow the deprecation timelines defined by
  DefCore.  That has implications for the V3 API currently in
  development to turn Glance into a more generic artifacts service.
  There are a lot of ways to handle those implications, and no
  choice needs to be made today, so I only mention it to make sure
  it's clear that (a) we must get V2 into shape for DefCore and
  (b) when that happens, we will need to maintain V2 even if V3
  is finished. We won't be able to deprecate V2 quickly.

Now, it's entirely possible that we can meet all of those requirements
today, and that would be great. If that's the case, then the problem
is just one of clear communication and documentation. I think there's
probably more work to be done than that, though.


There's clearly a communication problem. The fact that this very email
has been sent out is a sign of that. However, I'd like to say, in a
very optimistic way, that Glance is not so far away from the expecte
status. There are things to fix, other things to clarify, tons to
discuss but, IMHO, besides the tempests tests and DefCore, the most
critical one is the one you mentioned in the following section.


[1] http://developer.openstack.org/api-ref-image-v2.html#os-tasks-v2
[2]
http://docs.rackspace.com/images/api/v2/ci-devguide/content/POST_importImage_tasks_Image_Task_Calls.html#d6e4193

[3]
http://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/tasks.py

[4] http://git.openstack.org/cgit/openstack/defcore/tree/2015.05.json#n70
[5]
http://git.openstack.org/cgit/openstack/defcore/tree/doc/source/guidelines/2015.07.rst

[6] https://review.openstack.org/#/c/213353/

II. Complete Cinder and Nova V2 Adoption

The Glance team originally committed to providing an Image Service
API. Besides our end users, both Cinder and Nova consume that API.
The shift from V1 to V2 has been a long road. We're far enough
along, and the V1 API has enough issues preventing us from using
it for DefCore, that we should push ahead and complete the V2
adoption. That will let us properly deprecate and drop V1 support,
and concentrate on maintaining V2 for the necessary amount of time.

There are a few specs for the work needed in Nova, but that work
didn't land in Liberty for a variety of reasons. We need resources
from both the Glance and Nova teams to work together to get this
done as early as possible in Mitaka to ensure that it actually lands
this time. We should be able to schedule a joint session at the
summit to have the conversation, and we need to take advantage of
that opportunity to ensure the details are fully resolved so that
everyone understands the plan.

Super important point. I'd like people replying to this email to focus
on what we can do next and not why this hasn't been done. The later
will take us down a path that won't be useful at all at it'll just
waste everyone's time.

++

That said, I fully agree with the above. Last time we talked, John
Garbutt and Jay Pipes, from the nova team, raised their hands to help
out with this effort. From Glance's side, Fei Long Wang and myself
were working on the implementation. To help moving this forward and to
follow on the latest plan, which allows this migration to be smoother
than our original plan, we need folks from Glance to raise their hand.

If I'm not elected PTL, I'm more than happy to help out here but we
need someone that can commit to the above right now and we'll likely
need a team of at least 2 people to help moving this forward in early
Mitaka.


The work in Cinder is more complete, but may need to be reviewed
to ensure that it is using the API correctly, safely, and efficiently.
Again, this is a joint effort between the Glance and Cinder teams
to identify any issues and work out a resolution.

Part of this work will also be to audit the Glance API documentation,
to ensure it accurately reflects what the APIs expect to receive
and return. There are reportedly at least a few cases where things
are out of sync right now. This will require some coordination with
the Documentation team.


Those are the two big priorities I see, based on things the rest
of the community needs from the team and existing commitments that
have been made. There are some other things that should also be
addressed.


III. Security audits & bug fixes

Five of 18 recent security reports were related to Glance [7]. It's
not surprising, given recent resource constraints, that addressing
these has been a challenge. Still, these should be given high
priority.

[7]
https://security.openstack.org/search.html?q=glance&check_keywords=yes&area=default



+1 FWIW, we're in the process of growing Glance's security team. But
it's clear from the above that there needs to be quicker replies to
security issues.

IV. Sorting out the glance-store question

This was perhaps the most confusing thing I learned about this week.
The perception outside of the Glance team is that the library is
meant to be used by Nova and Cinder to communicate directly with
the image store, bypassing the REST API, to improve performance in
several cases. I know the Cinder team is especially interested in
some sort of interface for manipulating images inside the storage
system without having to download them to make copies (for RBD and
other systems that support CoW natively).

Correct, the above was one of the triggerers for this effort and I
like to think it's still one of the main drivers. There are other
fancier things that could be done in the future assuming the
librarie's API is refactored in a way that such features can be
implemented.[0]

[0] https://review.openstack.org/#/c/188050/

That doesn't seem to be
what the library is actually good for, though, since most of the
Glance core folks I talked to thought it was really a caching layer.
This discrepancy in what folks wanted vs. what they got may explain
some of the heated discussions in other email threads.

It's strange that some folks think of it as a caching layer. I believe
one of the reasons there's such discrepancy is because not enough
effort has been put in the refactor this library requires. The reason
this library requires such a refactor is that it came out from the old
`glance/store` code which was very specific to Glance's internal use.

The mistake here could be that the library should've been refactored
*before* adopting it in Glance.


Frankly, given the importance of the other issues, I recommend
leaving glance-store standalone this cycle. Unless the work for
dealing with priorities I and II is made *significantly* easier by
not having a library, the time and energy it will take to re-integrate
it with the Glance service seems like a waste of limited resources.
The time to even discuss it may be better spent on the planning
work needed. That said, if the library doesn't provide the features
its users were expecting, it may be better to fold it back in and
create a different library with a better understanding of the
requirements at some point. The path to take is up to the Glance
team, of course, but we're already down far enough on the priority
list that I think we'll be lucky to finish the preceding items this
cycle.


I don't think merging glance-store back into Glance will help with any
of the priorities mentioned in this thread. If anything, refactoring
the API might help with future work that could come after the v1 -> v2
migration is complete.



Those are the development priorities I was able to identify in my
interviews this week, and there is one last thing the team needs
to do this cycle: Recruit more contributors.

Almost every current core contributor I spoke with this week indicated
that their time was split between another project and Glance. Often
higher priority had to be given, understandibly, to internal product
work. That's the reality we work in, and everyone feels the same
pressures to some degree. One way to address that pressure is to
bring in help. So, we need a recruiting drive to find folks willing
to contribute code and reviews to the project to keep the team
healthy. I listed this item last because if you've made it this far
you should see just how much work the team has ahead. We're a big
community, and I'm confident that we'll be able to find help for
the Glance team, but it will require mentoring and education to
bring people up to speed to make them productive.

Fully agree here as well. However, I also believe that the fact that
some efforts have gone to the wrong tasks has taken Glance to the
situation it is today. More help is welcomed and required but a good
strategy is more important right now.

FWIW, I agree that our focus has gone to different thing and this has
taken us to the status you mentioned above. More importantly, it's
postponed some important tasks. However, I don't believe Glance is
completely broken - I know you are not saying this but I'd like to
mention it - and I certainly believe we can bring it back to a good
state faster than expecte, but I'm known for being a bit optimistic
sometimes.

In this reply I was hard on us (Glance team), because I tend to be
hard on myself and to dig deep into the things that are not working
well. Many times I do this based on the feedback provided by others,
which I personally value **a lot**. Unfortunately, I have to say that
there hasn't been enough feedback about these issues until now. There
was Mike's email[0] where I explicitly asked the community to speak
up. This is to say that I appreciate the time you've taken to dig into
this a lot and to encourage folks to *always* speak up and reach out
through every *public* medium possible..

No one can fix rumors, we can fix issues, though.

Thanks again and lets all work together to improve this situation,
Flavio

[0]
http://lists.openstack.org/pipermail/openstack-dev/2015-August/071971.html



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to