Re: [ovs-dev] [PATCH ovn] northd: Don't get stuck in the STATE_INIT_HASH_SIZES state.

Dumitru Ceara Thu, 25 Aug 2022 02:03:27 -0700

On 8/24/22 18:32, Ilya Maximets wrote:
> On 8/24/22 17:57, Dumitru Ceara wrote:
>> On 8/24/22 17:49, Ilya Maximets wrote:
>>> On 8/24/22 17:43, Ilya Maximets wrote:
>>>> On 8/23/22 16:42, Dumitru Ceara wrote:
>>>>> After removing the possibility of disabling logical datapath groups the
>>>>> above mentioned state should only be relevant when ovn-northd is started
>>>>> with more than one lflow processing threads (--n-threads > 1), to avoid
>>>>> a very inefficient first lflow processing run due to a suboptimally
>>>>> sized hmap.
>>>>>
>>>>> There's no need to ever go back to this state now that we don't allow
>>>>> users to disable logical datapath groups.  Moreover, the previous code
>>>>> was wrongfully setting the state to STATE_INIT_HASH_SIZES at every
>>>>> iteration, essentially disabling parallelization.
>>>>>
>>>>> Fixes: 9dea0c090e91 ("nb: Remove possibility of disabling logical 
>>>>> datapath groups.")
>>>>> Reported-by: Ilya Maximets <[email protected]>
>>>>> Signed-off-by: Dumitru Ceara <[email protected]>
>>>>> ---
>>>>
>>>> Hi, Dumitru.  Thanks for the patch!
>>
>> Hi Ilya,
>>
>>>> I've been using it in multiple scale tests since it was posted,
>>>> and it does re-enable parallel lflow build.
>>>>
>>
>> Thanks for trying it out!
>>
>>>> One issue though.  Not sure how to fix that or what is the desired
>>>> behavior, but what I noticed is:
>>>>
>>>>  1. Cluster with 3 northd instances starts up.
>>>>  2. After a while all of them has been active at least once, due
>>>>     to leadership transfers or other things.
>>>>  3. If one instance stays active for long, a lot of database changes
>>>>     can be accumulated.
>>>>  4. Once the leadership flips, new active northd is not prepared for
>>>>     the amount of logical flows it suddenly needs to manage.  So,
>>>>     it starts parallel processing with hmap significantly smaller
>>>>     than it actually needs right now.  That iteration takes noticeably
>>>>     more time than a usual one.  After the first iteration after
>>>>     becoming active, it can process lflows with a normal speed.
>>>>
>>>> So, I wonder if we can fall into a catastrophic scenario where
>>>> the current hash map size is very small and amount of flows after
>>>> re-connection is very high, so threads are only fighting for locks
>>>> all the time.
>>>>
>>>> Do you have some thoughts on this?
>>>>
>>
>> This seems like a possible scenario (although I wonder how often we'd
>> hit it in practice).
>>
>> But this was a possible scenario before 9dea0c090e91 ("nb: Remove
>> possibility of disabling logical datapath groups.") too, right?
> 
> Probably, yes.  There is no need to block this particular patch as
> it is fixing the separate bug. So,
> 
> Acked-by: Ilya Maximets <[email protected]>
>


Thanks!

>>
>>>> Should we disable parallelism when northd goes into stand-up state?
>>>
>>> * I meant 'standby', of course.  :/
>>>
>>
>> We could, but that means every time northd becomes active it will do one
>> first inefficient run.  Can we instead store the max_seen_lflow_size
>> value in the SB database and use the one we read from the DB?
> 
> It's an interesting idea, if it can be implemented efficiently.  I'm not
> sure how much extra traffic or northd wakeups that will generate.
> It might make more sense to store it in NB, just to avoid spamming all
> the controllers with monitor-all.  Also, NB DB generally has more room
> performance-wise.
> 

I'll try to work on a patch to store it in NB; we already write to NB
for other stuff like Logical_Switch_Port.up so it should be acceptable.

I'll run some benchmarks to see if there's any measurable negative
impact because of this and report back.

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Don't get stuck in the STATE_INIT_HASH_SIZES state.

Reply via email to