Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

Eugene Nikanorov Mon, 08 Jun 2015 00:26:29 -0700

Yes, 50-100 networks received by DHCP agent on startup could cause 2nd
state report to be sent seconds  after it should be sent.
In my tests, if I recall correctly, it was ~70 networks and delay between
1st and 2nd state report around 25 seconds (while 5 sec was configured)


Eugene.

On Sun, Jun 7, 2015 at 11:11 PM, Kevin Benton <[email protected]> wrote:

> Well a greenthread will only yield when it makes a blocking call like
> writing to a network socket, file, etc. So once the report_state
> greenthread starts executing, it won't yield until it makes a call like
> that.
>
> I looked through the report_state code for the DHCP agent and the only
> blocking call it seems to make is the AMQP report_state call/cast itself.
> So even with a bunch of other workers, the report_state thread should get
> execution fairly quickly since most of our workers should yield very
> frequently when they make process calls, etc. That's why I assumed that
> there must be something actually stopping it from sending the message.
>
> Do you have a way to reproduce the issue with the DHCP agent?
>
> On Sun, Jun 7, 2015 at 9:21 PM, Eugene Nikanorov <[email protected]>
> wrote:
>
>> No, I think greenthread itself don't do anything special, it's just when
>> there are too many threads, state_report thread can't get the control for
>> too long, since there is no prioritization of greenthreads.
>>
>> Eugene.
>>
>> On Sun, Jun 7, 2015 at 8:24 PM, Kevin Benton <[email protected]> wrote:
>>
>>> I understand now. So the issue is that the report_state greenthread is
>>> just blocking and yielding whenever it tries to actually send a message?
>>>
>>> On Sun, Jun 7, 2015 at 8:10 PM, Eugene Nikanorov <
>>> [email protected]> wrote:
>>>
>>>> Salvatore,
>>>>
>>>> By 'fairness' I meant chances for state report greenthread to get the
>>>> control. In DHCP case, each network processed by a separate greenthread, so
>>>> the more greenthreads agent has, the less chances that report state
>>>> greenthread will be able to report in time.
>>>>
>>>> Thanks,
>>>> Eugene.
>>>>
>>>> On Sun, Jun 7, 2015 at 4:15 AM, Salvatore Orlando <[email protected]>
>>>> wrote:
>>>>
>>>>> On 5 June 2015 at 01:29, Itsuro ODA <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> > After trying to reproduce this, I'm suspecting that the issue is
>>>>>> actually
>>>>>> > on the server side from failing to drain the agent report state
>>>>>> queue in
>>>>>> > time.
>>>>>>
>>>>>> I have seen before.
>>>>>> I thought the senario at that time as follows.
>>>>>> * a lot of create/update resource API issued
>>>>>> * "rpc_conn_pool_size" pool exhausted for sending notify and blocked
>>>>>>   farther sending side of RPC.
>>>>>> * "rpc_thread_pool_size" pool exhausted by waiting
>>>>>> "rpc_conn_pool_size"
>>>>>>   pool for replying RPC.
>>>>>> * receiving state_report is blocked because "rpc_thread_pool_size"
>>>>>> pool
>>>>>>   exhausted.
>>>>>>
>>>>>>
>>>>> I think this could be a good explanation couldn't it?
>>>>> Kevin proved that the periodic tasks are not mutually exclusive and
>>>>> that long process times for sync_routers are not an issue.
>>>>> However, he correctly suspected a server-side involvement, which could
>>>>> actually be a lot of requests saturating the RPC pool.
>>>>>
>>>>> On the other hand, how could we use this theory to explain why this
>>>>> issue tend to occur when the agent is restarted?
>>>>> Also, Eugene, what do you mean by stating that the issue could be in
>>>>> agent's "fairness"?
>>>>>
>>>>> Salvatore
>>>>>
>>>>>
>>>>>
>>>>>> Thanks
>>>>>> Itsuro Oda
>>>>>>
>>>>>> On Thu, 4 Jun 2015 14:20:33 -0700
>>>>>> Kevin Benton <[email protected]> wrote:
>>>>>>
>>>>>> > After trying to reproduce this, I'm suspecting that the issue is
>>>>>> actually
>>>>>> > on the server side from failing to drain the agent report state
>>>>>> queue in
>>>>>> > time.
>>>>>> >
>>>>>> > I set the report_interval to 1 second on the agent and added a
>>>>>> logging
>>>>>> > statement and I see a report every 1 second even when sync_routers
>>>>>> is
>>>>>> > taking a really long time.
>>>>>> >
>>>>>> > On Thu, Jun 4, 2015 at 11:52 AM, Carl Baldwin <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Ann,
>>>>>> > >
>>>>>> > > Thanks for bringing this up.  It has been on the shelf for a
>>>>>> while now.
>>>>>> > >
>>>>>> > > Carl
>>>>>> > >
>>>>>> > > On Thu, Jun 4, 2015 at 8:54 AM, Salvatore Orlando <
>>>>>> [email protected]>
>>>>>> > > wrote:
>>>>>> > > > One reason for not sending the heartbeat from a separate
>>>>>> greenthread
>>>>>> > > could
>>>>>> > > > be that the agent is already doing it [1].
>>>>>> > > > The current proposed patch addresses the issue blindly - that
>>>>>> is to say
>>>>>> > > > before declaring an agent dead let's wait for some more time
>>>>>> because it
>>>>>> > > > could be stuck doing stuff. In that case I would probably make
>>>>>> the
>>>>>> > > > multiplier (currently 2x) configurable.
>>>>>> > > >
>>>>>> > > > The reason for which state report does not occur is probably
>>>>>> that both it
>>>>>> > > > and the resync procedure are periodic tasks. If I got it right
>>>>>> they're
>>>>>> > > both
>>>>>> > > > executed as eventlet greenthreads but one at a time. Perhaps
>>>>>> then adding
>>>>>> > > an
>>>>>> > > > initial delay to the full sync task might ensure the first
>>>>>> thing an agent
>>>>>> > > > does when it comes up is sending a heartbeat to the server?
>>>>>> > > >
>>>>>> > > > On the other hand, while doing the initial full resync, is the
>>>>>> agent
>>>>>> > > able
>>>>>> > > > to process updates? If not perhaps it makes sense to have it
>>>>>> down until
>>>>>> > > it
>>>>>> > > > finishes synchronisation.
>>>>>> > >
>>>>>> > > Yes, it can!  The agent prioritizes updates from RPC over full
>>>>>> resync
>>>>>> > > activities.
>>>>>> > >
>>>>>> > > I wonder if the agent should check how long it has been since its
>>>>>> last
>>>>>> > > state report each time it finishes processing an update for a
>>>>>> router.
>>>>>> > > It normally doesn't take very long (relatively) to process an
>>>>>> update
>>>>>> > > to a single router.
>>>>>> > >
>>>>>> > > I still would like to know why the thread to report state is being
>>>>>> > > starved.  Anyone have any insight on this?  I thought that with
>>>>>> all
>>>>>> > > the system calls, the greenthreads would yield often.  There must
>>>>>> be
>>>>>> > > something I don't understand about it.
>>>>>> > >
>>>>>> > > Carl
>>>>>> > >
>>>>>> > >
>>>>>> __________________________________________________________________________
>>>>>> > > OpenStack Development Mailing List (not for usage questions)
>>>>>> > > Unsubscribe:
>>>>>> [email protected]?subject:unsubscribe
>>>>>> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Kevin Benton
>>>>>>
>>>>>> --
>>>>>> Itsuro ODA <[email protected]>
>>>>>>
>>>>>>
>>>>>>
>>>>>> __________________________________________________________________________
>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>> Unsubscribe:
>>>>>> [email protected]?subject:unsubscribe
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> __________________________________________________________________________
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe:
>>>>> [email protected]?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> [email protected]?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Benton
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> [email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
>
> --
> Kevin Benton
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

Reply via email to