Re: [PATCH] drm/scheduler: fix race condition in load balancer

Nirmoy Wed, 15 Jan 2020 05:23:12 -0800

I think I know why it happens. At init all entity's rq gets assignedto sched_list[0]. I put some prints to check what we compare indrm_sched_entity_get_free_sched.
It turns out most of the time it compares zero values(num_jobs(0) <min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was pickedby drm_sched_entity_get_free_sched.
Well that is expected because the unit tests always doessubmission,wait,submission,wait,submission,wait.... So the number ofjobs in the scheduler becomes zero in between.

Even with multiple parallel instances of amdgpu_test, I haven't seen anyimprovement in the load balance.

This patch was not correct , had an extra atomic_inc(num_jobs) indrm_sched_job_init. This probably added bit of randomness I think,which helped in better job distribution.
Mhm, that might not be a bad idea after all. We could rename num_jobsinto something like like score and do a +1 indrm_sched_rq_add_entity() and a -1 in drm_sched_rq_remove_entity().
That should have pretty much the effect we want to have.

That's sounds good as well. I will create a patch.

I've updated my previous RFC patch which uses time consumed by eachsched for load balance with a twist of ignoring previously scheduledsched/rq. Let me know what do you think.
I didn't had time yet to wrap my head around that in detail, but atleast of hand Luben is right that the locking looks really awkward.

I was unable to find a better way to do the locking part. My mail clientmight've missed Luben's review, can't find it :/



Regards,

Nirmoy

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/scheduler: fix race condition in load balancer

Reply via email to