Re: [openstack-dev] [all] [clients] [keystone] lack of retrying tokens leads to overall OpenStack fragility

Flavio Percoco Fri, 12 Sep 2014 02:11:16 -0700

On 09/11/2014 01:44 PM, Sean Dague wrote:
> On 09/10/2014 08:46 PM, Jamie Lennox wrote:
>>
>> ----- Original Message -----
>>> From: "Steven Hardy" <sha...@redhat.com>
>>> To: "OpenStack Development Mailing List (not for usage questions)" 
>>> <openstack-dev@lists.openstack.org>
>>> Sent: Thursday, September 11, 2014 1:55:49 AM
>>> Subject: Re: [openstack-dev] [all] [clients] [keystone] lack of retrying 
>>> tokens leads to overall OpenStack fragility
>>>
>>> On Wed, Sep 10, 2014 at 10:14:32AM -0400, Sean Dague wrote:
>>>> Going through the untriaged Nova bugs, and there are a few on a similar
>>>> pattern:
>>>>
>>>> Nova operation in progress.... takes a while
>>>> Crosses keystone token expiration time
>>>> Timeout thrown
>>>> Operation fails
>>>> Terrible 500 error sent back to user
>>>
>>> We actually have this exact problem in Heat, which I'm currently trying to
>>> solve:
>>>
>>> https://bugs.launchpad.net/heat/+bug/1306294
>>>
>>> Can you clarify, is the issue either:
>>>
>>> 1. Create novaclient object with username/password
>>> 2. Do series of operations via the client object which eventually fail
>>> after $n operations due to token expiry
>>>
>>> or:
>>>
>>> 1. Create novaclient object with username/password
>>> 2. Some really long operation which means token expires in the course of
>>> the service handling the request, blowing up and 500-ing
>>>
>>> If the former, then it does sound like a client, or usage-of-client bug,
>>> although note if you pass a *token* vs username/password (as is currently
>>> done for glance and heat in tempest, because we lack the code to get the
>>> token outside of the shell.py code..), there's nothing the client can do,
>>> because you can't request a new token with longer expiry with a token...
>>>
>>> However if the latter, then it seems like not really a client problem to
>>> solve, as it's hard to know what action to take if a request failed
>>> part-way through and thus things are in an unknown state.
>>>
>>> This issue is a hard problem, which can possibly be solved by
>>> switching to a trust scoped token (service impersonates the user), but then
>>> you're effectively bypassing token expiry via delegation which sits
>>> uncomfortably with me (despite the fact that we may have to do this in heat
>>> to solve the afforementioned bug)
>>>
>>>> It seems like we should have a standard pattern that on token expiration
>>>> the underlying code at least gives one retry to try to establish a new
>>>> token to complete the flow, however as far as I can tell *no* clients do
>>>> this.
>>>
>>> As has been mentioned, using sessions may be one solution to this, and
>>> AFAIK session support (where it doesn't already exist) is getting into
>>> various clients via the work being carried out to add support for v3
>>> keystone by David Hu:
>>>
>>> https://review.openstack.org/#/q/owner:david.hu%2540hp.com,n,z
>>>
>>> I see patches for Heat (currently gating), Nova and Ironic.
>>>
>>>> I know we had to add that into Tempest because tempest runs can exceed 1
>>>> hr, and we want to avoid random fails just because we cross a token
>>>> expiration boundary.
>>>
>>> I can't claim great experience with sessions yet, but AIUI you could do
>>> something like:
>>>
>>> from keystoneclient.auth.identity import v3
>>> from keystoneclient import session
>>> from keystoneclient.v3 import client
>>>
>>> auth = v3.Password(auth_url=OS_AUTH_URL,
>>>                    username=USERNAME,
>>>                    password=PASSWORD,
>>>                    project_id=PROJECT,
>>>                    user_domain_name='default')
>>> sess = session.Session(auth=auth)
>>> ks = client.Client(session=sess)
>>>
>>> And if you can pass the same session into the various clients tempest
>>> creates then the Password auth-plugin code takes care of reauthenticating
>>> if the token cached in the auth plugin object is expired, or nearly
>>> expired:
>>>
>>> https://github.com/openstack/python-keystoneclient/blob/master/keystoneclient/auth/identity/base.py#L120
>>>
>>> So in the tempest case, it seems like it may be a case of migrating the
>>> code creating the clients to use sessions instead of passing a token or
>>> username/password into the client object?
>>>
>>> That's my understanding of it atm anyway, hopefully jamielennox will be 
>>> along
>>> soon with more details :)
>>>
>>> Steve
>>
>>
>> By clients here are you referring to the CLIs or the python libraries? 
>> Implementation is at different points with each. 
>>
>> Sessions will handle automatically reauthenticating and retrying a request, 
>> however it relies on the service throwing a 401 Unauthenticated error. If a 
>> service is returning a 500 (or a timeout?) then there isn't much that a 
>> client can/should do for that because we can't assume that trying again with 
>> a new token will solve anything. 
>>
>> At the moment we have keystoneclient, novaclient, cinderclient neutronclient 
>> and then a number of the smaller projects with support for sessions. That 
>> obviously doesn't mean that existing users of that code have transitioned to 
>> the newer way though. David Hu has been working on using this code within 
>> the existing CLIs. I have prototypes for at least nova to talk to neutron 
>> and cinder which i'm waiting for Kilo to push. From there it should be 
>> easier to do this for other services. 
>>
>> For service to service communication there are two types.
>> 1) using the user's token like nova->cinder. If this token expires there is 
>> really nothing that nova can do except raise 401 and make the client do it 
>> again. 
> 
> In this case it would be really good to do at least 1 retry, because
> it's completely silly for us to fail an action based on a token timeout.
> The solution ops are doing is changing their token expiration back to
> some really large number.
> 
>> 2) using a service user like nova->neutron. This should allow automatic 
>> reauthentication and will be fixed/standardied by sessions. 
> 
> Ok, glanceclient should be a high target here, because that's often
> involved in long running things (snapshot manip is slow).


Agreed. I started looking at this a couple of weeks ago but I'm still
not sure what the best way to do this is. The failure is common when
uploading huge images and I also agree that at least 1 retry should be
attempted.


Flavio


-- 
@flaper87
Flavio Percoco

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [all] [clients] [keystone] lack of retrying tokens leads to overall OpenStack fragility

Reply via email to