Re: amdgpu problem after kexec

2021-02-03 Thread Eric W. Biederman
Alex Deucher  writes:

> On Wed, Feb 3, 2021 at 3:36 AM Dave Young  wrote:
>>
>> Hi Baoquan,
>>
>> Thanks for ccing.
>> On 01/28/21 at 01:29pm, Baoquan He wrote:
>> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
>> > > Hello,
>> > >
>> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
>> > > G6. The problem is, amdgpu does not have hardware acceleration after
>> > > kexec. Also, strangely, the lines about BlueTooth are missing from
>> > > dmesg after kexec, but I have not tried to use BlueTooth on this
>> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
>> > > in dmesg are:
>> > >
>> > > amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
>> > > test failed on gfx (-110).
>> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
>> > >
>> > > The good and bad dmesg files are attached. Is it a kexec problem (and
>> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
>> > > need to provide some extra kernel arguments for debugging?

The best debugging I can think of is can you arrange to have the amdgpu
modules removed before the final kexec -e?

That would tell us if the code to shutdown the gpu exist in the rmmod
path aka the .remove method and is simply missing in the kexec path aka
the .shutdown method.


>> > I am not familiar with graphical component. Add Dave to CC to see if
>> > he has some comments. It would be great if amdgpu expert can have a look.
>>
>> It needs amdgpu driver people to help.  Since kexec bypass
>> bios/UEFI initialization so we requires drivers to implement .shutdown
>> method and test it to make 2nd kernel to work correctly.
>
> kexec is tricky to make work properly on our GPUs.  The problem is
> that there are some engines on the GPU that cannot be re-initialized
> once they have been initialized without an intervening device reset.
> APUs are even trickier because they share a lot of hardware state with
> the CPU.  Doing lots of extra resets adds latency.  The driver has
> code to try and detect if certain engines are running at driver load
> time and do a reset before initialization to make this work, but it
> apparently is not working properly on your system.

There are two cases that I think sometimes get mixed up.

There is kexec-on-panic in which case all of the work needs to happen in
the driver initialization.

There is also a simple kexec in which case some of the work can happen
in the kernel that is being shutdown and sometimes that is easer.

Does it make sense to reset your device unconditionally on driver removal?
Would it make sense to reset your device unconditionally on driver add?

How can someone debug the smart logic of reset on driver load?

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 51/58] kernel: Utilize new kmap_thread()

2020-10-10 Thread Eric W. Biederman
ira.we...@intel.com writes:

> From: Ira Weiny 
>
> This kmap() call is localized to a single thread.  To avoid the over
> head of global PKRS updates use the new kmap_thread() call.

Acked-by: "Eric W. Biederman" 

>
> Cc: Eric Biederman 
> Signed-off-by: Ira Weiny 
> ---
>  kernel/kexec_core.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..272a9920c0d6 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -815,7 +815,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   if (result < 0)
>   goto out;
>  
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   /* Start with a clear page */
>   clear_page(ptr);
>   ptr += maddr & ~PAGE_MASK;
> @@ -828,7 +828,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   memcpy(ptr, kbuf, uchunk);
>   else
>   result = copy_from_user(ptr, buf, uchunk);
> - kunmap(page);
> + kunmap_thread(page);
>   if (result) {
>   result = -EFAULT;
>   goto out;
> @@ -879,7 +879,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   goto out;
>   }
>   arch_kexec_post_alloc_pages(page_address(page), 1, 0);
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   ptr += maddr & ~PAGE_MASK;
>   mchunk = min_t(size_t, mbytes,
>   PAGE_SIZE - (maddr & ~PAGE_MASK));
> @@ -895,7 +895,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   else
>   result = copy_from_user(ptr, buf, uchunk);
>   kexec_flush_icache_page(page);
> - kunmap(page);
> + kunmap_thread(page);
>   arch_kexec_pre_free_pages(page_address(page), 1);
>   if (result) {
>   result = -EFAULT;
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-05-01 Thread Eric W. Biederman
Christian König <ckoenig.leichtzumer...@gmail.com> writes:

> Hi Eric,
>
> sorry for the late response, was on vacation last week.
>
> Am 26.04.2018 um 02:01 schrieb Eric W. Biederman:
>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>
>>> On 04/25/2018 01:17 PM, Oleg Nesterov wrote:
>>>> On 04/25, Andrey Grodzovsky wrote:
>>>>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be
>>>>> able to exit immediately
>>>>> and not wait for GPU jobs completion when the reason for reaching this 
>>>>> code
>>>>> is because of KILL
>>>>> signal to the user process who opened the device file.
>>>> Can you hook f_op->flush method?
>
> THANKS! That sounds like a really good idea to me and we haven't investigated
> into that direction yet.

For the backwards compatibility concerns you cite below the flush method
seems a much better place to introduce the wait.  You at least really
will be in a process context for that.  Still might be in exit but at
least you will be legitimately be in a process.

>>> But this one is called for each task releasing a reference to the the file, 
>>> so
>>> not sure I see how this solves the problem.
>> The big question is why do you need to wait during the final closing a
>> file?
>
> As always it's because of historical reasons. Initially user space pushed
> commands directly to a hardware queue and when a processes finished we didn't
> need to wait for anything.
>
> Then the GPU scheduler was introduced which delayed pushing the jobs to the
> hardware queue to a later point in time.
>
> This wait was then added to maintain backward compability and not break
> userspace (but see below).

That make sense.

>> The wait can be terminated so the wait does not appear to be simply a
>> matter of correctness.
>
> Well when the process is killed we don't care about correctness any more, we
> just want to get rid of it as quickly as possible (OOM situation etc...).
>
> But it is perfectly possible that a process submits some render commands and
> then calls exit() or terminates because of a SIGTERM, SIGINT etc.. In this 
> case
> we need to wait here to make sure that all rendering is pushed to the hardware
> because the scheduler might need resources/settings from the file
> descriptor.
>
> For example if you just remove that wait you could close firefox and get 
> garbage
> on the screen for a millisecond because the remaining rendering commands where
> not executed.
>
> So what we essentially need is to distinct between a SIGKILL (which means stop
> processing as soon as possible) and any other reason because then we don't 
> want
> to annoy the user with garbage on the screen (even if it's just for a few
> milliseconds).

I see a couple of issues.

- Running the code in release rather than in flush.

Using flush will catch every close so it should be more backwards
compatible.  f_op->flush always runs in process context so looking at
current makes sense.

- Distinguishing between death by SIGKILL and other process exit deaths.

In f_op->flush the code can test "((tsk->flags & PF_EXITING) &&
(tsk->code == SIGKILL))" to see if it was SIGKILL that terminated
the process.

- Dealing with stuck queues (where this patchset came in).

For stuck queues you are going to need a timeout instead of the current
indefinite wait after PF_EXITING is set.  From what you have described a
few milliseconds should be enough.  If PF_EXITING is not set you can
still just make the wait killable and skip the timeout if that will give
a better backwards compatible user experience.

What can't be done is try and catch SIGKILL after a process has called
do_exit.  A dead process is a dead process.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-27 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/26/2018 08:34 AM, Andrey Grodzovsky wrote:
>>
>>
>> On 04/25/2018 08:01 PM, Eric W. Biederman wrote:
>>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>>
>>>> On 04/25/2018 01:17 PM, Oleg Nesterov wrote:
>>>>> On 04/25, Andrey Grodzovsky wrote:
>>>>>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be
>>>>>> able to exit immediately
>>>>>> and not wait for GPU jobs completion when the reason for reaching this
>>>>>> code
>>>>>> is because of KILL
>>>>>> signal to the user process who opened the device file.
>>>>> Can you hook f_op->flush method?
>>>> But this one is called for each task releasing a reference to the the file,
>>>> so
>>>> not sure I see how this solves the problem.
>>> The big question is why do you need to wait during the final closing a
>>> file?
>>>
>>> The wait can be terminated so the wait does not appear to be simply a
>>> matter of correctness.
>>
>> Well, as I understand it, it just means that you don't want to abruptly
>> terminate GPU work in progress without a good
>> reason (such as KILL signal). When we exit we are going to release various
>> resources GPU is still using so we either
>> wait for it to complete or terminate the remaining jobs.

At the point of do_exit you might as well be a KILL signal however you
got there.

> Looked more into code, some correction, drm_sched_entity_fini means the SW job
> queue itself is about to die, so we must
> either wait for completion or terminate any outstanding jobs that are still in
> the SW queue. Anything which already in flight in HW
> will still complete.

It sounds like we don't care if we block the process that had the file
descriptor open, this is just book keeping.  Which allows having a piece
of code that cleans up resources when the GPU is done with the queue but
does not make userspace wait.  (option 1)

For it to make sense that we let the process run there has to be
something that cares about the results being completed.  If all of the
file descriptors are closed and the process is killed I can't see who
will care that the software queue will continue to be processed.  So it
may be reasonable to simply kill the queue (option 2).

If userspace really needs the wait it is probably better done in
f_op->flush so that every close of the file descriptor blocks
until the queue is flushed (option 3).

Do you know if userspace cares about the gpu operations completing?

My skim of the code suggests that nothing actually cares about those
operations, but I really don't know the gpu well.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> On 04/25/2018 01:17 PM, Oleg Nesterov wrote:
>> On 04/25, Andrey Grodzovsky wrote:
>>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be
>>> able to exit immediately
>>> and not wait for GPU jobs completion when the reason for reaching this code
>>> is because of KILL
>>> signal to the user process who opened the device file.
>> Can you hook f_op->flush method?
>
> But this one is called for each task releasing a reference to the the file, so
> not sure I see how this solves the problem.

The big question is why do you need to wait during the final closing a
file?

The wait can be terminated so the wait does not appear to be simply a
matter of correctness.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/24/2018 12:30 PM, Eric W. Biederman wrote:
>> "Panariti, David" <david.panar...@amd.com> writes:
>>
>>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>>> Kind of dma_fence_wait_killable, except that we don't have such API
>>>> (maybe worth adding ?)
>>> Depends on how many places it would be called, or think it might be called. 
>>>  Can always factor on the 2nd time it's needed.
>>> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
>>> using `M-.' ;->
>>>
>>> Also, if the wait could be very long, would a log message, something like 
>>> "xxx has run for Y seconds."  help?
>>> I personally hate hanging w/no info.
>> Ugh.  This loop appears susceptible to loosing wake ups.  There are
>> races between when a wake-up happens, when we clear the sleeping state,
>> and when we test the stat to see if we should stat awake.  So yes
>> implementing a dma_fence_wait_killable that handles of all that
>> correctly sounds like an very good idea.
>
> I am not clear here - could you be more specific about what races will happen
> here, more bellow
>>
>> Eric
>>
>>
>>>> If the ring is hanging for some reason allow to recover the waiting by 
>>>> sending fatal signal.
>>>>
>>>> Originally-by: David Panariti <david.panar...@amd.com>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>> ---
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>>>>1 file changed, 10 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index eb80edf..37a36af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx 
>>>> *ctx, unsigned ring_id)
>>>>
>>>>if (other) {
>>>>signed long r;
>>>> - r = dma_fence_wait_timeout(other, false, 
>>>> MAX_SCHEDULE_TIMEOUT);
>>>> - if (r < 0) {
>>>> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
>>>> - return r;
>>>> +
>>>> + while (true) {
>>>> + if ((r = dma_fence_wait_timeout(other, true,
>>>> + MAX_SCHEDULE_TIMEOUT)) >= 0)
>>>> + return 0;
>>>> +
>
> Do you mean that by the time I reach here some other thread from my group
> already might dequeued SIGKILL since it's a shared signal and hence
> fatal_signal_pending will return false ? Or are you talking about the
> dma_fence_wait_timeout implementation in dma_fence_default_wait with
> schedule_timeout ?

Given Oleg's earlier comment about the scheduler having special cases
for signals I might be wrong.  But in general there is a pattern:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (loop_is_done())
break;
schedule();
}
set_current_state(TASK_RUNNING);

If you violate that pattern by testing for a condition without
having first set your task as TASK_UNINTERRUPTIBLE (or whatever your
sleep state is).  Then it is possible to miss a wake-up that
tests the condidtion.

Thus I am quite concerned that there is a subtle corner case where
you can miss a wakeup and not retest fatal_signal_pending().

Given that there is is a timeout the worst case might have you sleep
MAX_SCHEDULE_TIMEOUT instead of indefinitely.

Without a comment why this is safe, or having fatal_signal_pending
check integrated into dma_fence_wait_timeout I am not comfortable
with this loop.

Eric


>>>> + if (fatal_signal_pending(current)) {
>>>> + DRM_ERROR("Error (%ld) waiting for 
>>>> fence!\n", r);
>>>> + return r;
>>>> + }
>>>>}
>>>>}
>>>>
>>>> --
>>>> 2.7.4
>>>>
>> Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/25/2018 11:29 AM, Eric W. Biederman wrote:
>
>>  Another issue is changing wait_event_killable to wait_event_timeout where I 
>> need
>> to understand
>> what TO value is acceptable for all the drivers using the scheduler, or 
>> maybe it
>> should come as a property
>> of drm_sched_entity.
>>
>> It would not surprise me if you could pick a large value like 1 second
>> and issue a warning if that time outever triggers.  It sounds like the
>> condition where we wait indefinitely today is because something went
>> wrong in the driver.
>
> We wait here for all GPU jobs in flight which belong to the dying entity to 
> complete. The driver submits
> the GPU jobs but the content of the job might be is not under driver's 
> control and could take 
> long time to finish or even hang (e.g. graphic or compute shader) , I
> guess that why originally the wait is indefinite.


I am ignorant of what user space expect or what the semantics of the
susbsystem are here, so I might be completely off base.  But this wait
for a long time behavior I would expect much more from f_op->flush or a
f_op->fsync method.

fsync so it could be obtained without closing the file descriptor.
flush so that you could get a return value out to close.

But I honestly don't know semantically what your userspace applications
expect and/or require so I can really only say.  Those of weird semantics.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/25/2018 03:14 AM, Daniel Vetter wrote:
>> On Tue, Apr 24, 2018 at 05:37:08PM -0400, Andrey Grodzovsky wrote:
>>>
>>> On 04/24/2018 05:21 PM, Eric W. Biederman wrote:
>>>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>>>
>>>>> On 04/24/2018 03:44 PM, Daniel Vetter wrote:
>>>>>> On Tue, Apr 24, 2018 at 05:46:52PM +0200, Michel Dänzer wrote:
>>>>>>> Adding the dri-devel list, since this is driver independent code.
>>>>>>>
>>>>>>>
>>>>>>> On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
>>>>>>>> Avoid calling wait_event_killable when you are possibly being called
>>>>>>>> from get_signal routine since in that case you end up in a deadlock
>>>>>>>> where you are alreay blocked in singla processing any trying to wait
>>>>>>> Multiple typos here, "[...] already blocked in signal processing and 
>>>>>>> [...]"?
>>>>>>>
>>>>>>>
>>>>>>>> on a new signal.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
>>>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>> index 088ff2b..09fd258 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
>>>>>>>> drm_gpu_scheduler *sched,
>>>>>>>>return;
>>>>>>>>/**
>>>>>>>> * The client will not queue more IBs during this fini, consume 
>>>>>>>> existing
>>>>>>>> -   * queued IBs or discard them on SIGKILL
>>>>>>>> +   * queued IBs or discard them when in death signal state since
>>>>>>>> +   * wait_event_killable can't receive signals in that state.
>>>>>>>>*/
>>>>>>>> -  if ((current->flags & PF_SIGNALED) && current->exit_code == 
>>>>>>>> SIGKILL)
>>>>>>>> +  if (current->flags & PF_SIGNALED)
>>>>>> You want fatal_signal_pending() here, instead of inventing your own 
>>>>>> broken
>>>>>> version.
>>>>> I rely on current->flags & PF_SIGNALED because this being set from
>>>>> within get_signal,
>>>> It doesn't mean that.  Unless you are called by do_coredump (you
>>>> aren't).
>>> Looking in latest code here
>>> https://elixir.bootlin.com/linux/v4.17-rc2/source/kernel/signal.c#L2449
>>> i see that current->flags |= PF_SIGNALED; is out side of
>>> if (sig_kernel_coredump(signr)) {...} scope
>> Ok I read some more about this, and I guess you go through process exit
>> and then eventually close. But I'm not sure.
>>
>> The code in drm_sched_entity_fini also looks strange: You unpark the
>> scheduler thread before you remove all the IBs. At least from the comment
>> that doesn't sound like what you want to do.
>
> I think it should be safe for the dying scheduler entity since before that (in
> drm_sched_entity_do_release) we set it's runqueue to NULL
> so no new jobs will be dequeued form it by the scheduler thread.
>
>>
>> But in general, PF_SIGNALED is really something deeply internal to the
>> core (used for some book-keeping and accounting). The drm scheduler is the
>> only thing looking at it, so smells like a layering violation. I suspect
>> (but without knowing what you're actually trying to achive here can't be
>> sure) you want to look at something else.
>>
>> E.g. PF_EXITING seems to be used in a lot more places to cancel stuff
>> that's no longer relevant when a task exits, not PF_SIGNALED. There's the
>> TIF_MEMDIE flag if you're hacking around issues with the oom-killer.
>>
>> This here on the other hand looks really fragile, and probably only does
>> what you want to do by accident.
>> -Daniel
>
> Yes , that what Eric also said and in the V2 patches i will try  to change
> PF_EXITING
>
> Another issue is changing wait_event_killable to wait_event_timeout where I 
> need
> to understand
> what TO value is acceptable for all the drivers using the scheduler, or maybe 
> it
> should come as a property
> of drm_sched_entity.

It would not surprise me if you could pick a large value like 1 second
and issue a warning if that time outever triggers.  It sounds like the
condition where we wait indefinitely today is because something went
wrong in the driver.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> On 04/24/2018 03:44 PM, Daniel Vetter wrote:
>> On Tue, Apr 24, 2018 at 05:46:52PM +0200, Michel Dänzer wrote:
>>> Adding the dri-devel list, since this is driver independent code.
>>>
>>>
>>> On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
 Avoid calling wait_event_killable when you are possibly being called
 from get_signal routine since in that case you end up in a deadlock
 where you are alreay blocked in singla processing any trying to wait
>>> Multiple typos here, "[...] already blocked in signal processing and [...]"?
>>>
>>>
 on a new signal.

 Signed-off-by: Andrey Grodzovsky 
 ---
   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

 diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
 b/drivers/gpu/drm/scheduler/gpu_scheduler.c
 index 088ff2b..09fd258 100644
 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
 +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
 @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
 drm_gpu_scheduler *sched,
return;
/**
 * The client will not queue more IBs during this fini, consume existing
 -   * queued IBs or discard them on SIGKILL
 +   * queued IBs or discard them when in death signal state since
 +   * wait_event_killable can't receive signals in that state.
*/
 -  if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
 +  if (current->flags & PF_SIGNALED)
>> You want fatal_signal_pending() here, instead of inventing your own broken
>> version.
>
> I rely on current->flags & PF_SIGNALED because this being set from
> within get_signal,

It doesn't mean that.  Unless you are called by do_coredump (you
aren't).  The closing of files does not happen in do_coredump.
Which means you are being called from do_exit.
In fact you are being called after exit_files which closes
the files.  The actual __fput processing happens in task_work_run.

> meaning I am within signal processing  in which case I want to avoid
> any signal based wait for that task,
> From what i see in the code, task_struct.pending.signal is being set
> for other threads in same
> group (zap_other_threads) or for other scenarios, those task are still
> able to receive signals
> so calling wait_event_killable there will not have problem.

Excpet that you are geing called after from do_exit and after exit_files
which is after exit_signal.  Which means that PF_EXITING has been set.
Which implies that the kernel signal handling machinery has already
started being torn down.

Not as much as I would like to happen at that point as we are still
left with some old CLONE_PTHREAD messes in the code that need to be
cleaned up.

Still given the fact you are task_work_run it is quite possible even
release_task has been run on that task before the f_op->release method
is called.  So you simply can not count on signals working.

Which in practice leaves a timeout for ending your wait.  That code can
legitimately be in a context that is neither interruptible nor killable.

entity->fini_status = -ERESTARTSYS;
else
entity->fini_status = wait_event_killable(sched->job_scheduled,
>> But really this smells like a bug in wait_event_killable, since
>> wait_event_interruptible does not suffer from the same bug. It will return
>> immediately when there's a signal pending.
>
> Even when wait_event_interruptible is called as following - 
> ...->do_signal->get_signal->->wait_event_interruptible ?
> I haven't tried it but wait_event_interruptible is very much alike to
> wait_event_killable so I would assume it will also
> not be interrupted if called like that. (Will give it a try just out
> of curiosity anyway)

As PF_EXITING is set want_signal should fail and the signal state of the
task should not be updatable by signals.

Eric


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/24/2018 05:21 PM, Eric W. Biederman wrote:
>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>
>>> On 04/24/2018 03:44 PM, Daniel Vetter wrote:
>>>> On Tue, Apr 24, 2018 at 05:46:52PM +0200, Michel Dänzer wrote:
>>>>> Adding the dri-devel list, since this is driver independent code.
>>>>>
>>>>>
>>>>> On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
>>>>>> Avoid calling wait_event_killable when you are possibly being called
>>>>>> from get_signal routine since in that case you end up in a deadlock
>>>>>> where you are alreay blocked in singla processing any trying to wait
>>>>> Multiple typos here, "[...] already blocked in signal processing and 
>>>>> [...]"?
>>>>>
>>>>>
>>>>>> on a new signal.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>> ---
>>>>>>drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>>>>>>1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> index 088ff2b..09fd258 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
>>>>>> drm_gpu_scheduler *sched,
>>>>>>  return;
>>>>>>  /**
>>>>>>   * The client will not queue more IBs during this fini, consume 
>>>>>> existing
>>>>>> - * queued IBs or discard them on SIGKILL
>>>>>> + * queued IBs or discard them when in death signal state since
>>>>>> + * wait_event_killable can't receive signals in that state.
>>>>>>  */
>>>>>> -if ((current->flags & PF_SIGNALED) && current->exit_code == 
>>>>>> SIGKILL)
>>>>>> +if (current->flags & PF_SIGNALED)
>>>> You want fatal_signal_pending() here, instead of inventing your own broken
>>>> version.
>>> I rely on current->flags & PF_SIGNALED because this being set from
>>> within get_signal,
>> It doesn't mean that.  Unless you are called by do_coredump (you
>> aren't).
>
> Looking in latest code here
> https://elixir.bootlin.com/linux/v4.17-rc2/source/kernel/signal.c#L2449
> i see that current->flags |= PF_SIGNALED; is out side of
> if (sig_kernel_coredump(signr)) {...} scope

In small words.  You showed me the backtrace and I have read
the code.

PF_SIGNALED means you got killed by a signal.
get_signal
  do_coredump
  do_group_exit
do_exit
   exit_signals
  sets PF_EXITING
   exit_mm
  calls fput on mmaps
 calls sched_task_work
   exit_files
  calls fput on open files
 calls sched_task_work
   exit_task_work
  task_work_run
 /* you are here */

So strictly speaking you are inside of get_signal it is not
meaningful to speak of yourself as within get_signal.

I am a little surprised to see task_work_run called so early.
I was mostly expecting it to happen when the dead task was
scheduling away, like normally happens.

Testing for PF_SIGNALED does not give you anything at all
that testing for PF_EXITING (the flag that signal handling
is shutdown) does not get you.

There is no point in distinguishing PF_SIGNALED from any other
path to do_exit.  do_exit never returns.

The task is dead.

Blocking indefinitely while shutting down a task is a bad idea.
Blocking indefinitely while closing a file descriptor is a bad idea.

The task has been killed it can't get more dead.  SIGKILL is meaningless
at this point.

So you need a timeout, or not to wait at all.


Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Eric W. Biederman
Daniel Vetter  writes:

> On Tue, Apr 24, 2018 at 05:46:52PM +0200, Michel Dänzer wrote:
>> 
>> Adding the dri-devel list, since this is driver independent code.
>> 
>> 
>> On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
>> > Avoid calling wait_event_killable when you are possibly being called
>> > from get_signal routine since in that case you end up in a deadlock
>> > where you are alreay blocked in singla processing any trying to wait
>> 
>> Multiple typos here, "[...] already blocked in signal processing and [...]"?
>> 
>> 
>> > on a new signal.
>> > 
>> > Signed-off-by: Andrey Grodzovsky 
>> > ---
>> >  drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>> >  1 file changed, 3 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
>> > b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > index 088ff2b..09fd258 100644
>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
>> > drm_gpu_scheduler *sched,
>> >return;
>> >/**
>> > * The client will not queue more IBs during this fini, consume existing
>> > -   * queued IBs or discard them on SIGKILL
>> > +   * queued IBs or discard them when in death signal state since
>> > +   * wait_event_killable can't receive signals in that state.
>> >*/
>> > -  if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
>> > +  if (current->flags & PF_SIGNALED)
>
> You want fatal_signal_pending() here, instead of inventing your own broken
> version.
>> >entity->fini_status = -ERESTARTSYS;
>> >else
>> >entity->fini_status = wait_event_killable(sched->job_scheduled,
>
> But really this smells like a bug in wait_event_killable, since
> wait_event_interruptible does not suffer from the same bug. It will return
> immediately when there's a signal pending.
>
> I think this should be fixed in core code, not papered over in some
> subsystem.

PF_SIGNALED does not mean a signal has been sent.  PF_SIGNALED means
the process was killed by a signal.

Neither of interruptible or killable makes sense after the process has
been killed.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> Avoid calling wait_event_killable when you are possibly being called
> from get_signal routine since in that case you end up in a deadlock
> where you are alreay blocked in singla processing any trying to wait
> on a new signal.

I am curious what the call path that is problematic here.

In general waiting seems wrong when the process has already been
fatally killed as indicated by PF_SIGNALED.

Returning -ERESTARTSYS seems wrong as nothing should make it back even
to the edge of userspace here.

Given that this is the only use of PF_SIGNALED outside of bsd process
accounting I find this code very suspicious.

It looks the code path that gets called during exit is buggy and needs
to be sorted out.

Eric


> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 088ff2b..09fd258 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
> drm_gpu_scheduler *sched,
>   return;
>   /**
>* The client will not queue more IBs during this fini, consume existing
> -  * queued IBs or discard them on SIGKILL
> +  * queued IBs or discard them when in death signal state since
> +  * wait_event_killable can't receive signals in that state.
>   */
> - if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
> + if (current->flags & PF_SIGNALED)
>   entity->fini_status = -ERESTARTSYS;
>   else
>   entity->fini_status = wait_event_killable(sched->job_scheduled,
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/3] signals: Allow generation of SIGKILL to exiting task.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/24/2018 12:42 PM, Eric W. Biederman wrote:
>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>
>>> Currently calling wait_event_killable as part of exiting process
>>> will stall forever since SIGKILL generation is suppresed by PF_EXITING.
>>>
>>> In our partilaur case AMDGPU driver wants to flush all GPU jobs in
>>> flight before shutting down. But if some job hangs the pipe we still want to
>>> be able to kill it and avoid a process in D state.
>> I should clarify.  This absolutely can not be done.
>> PF_EXITING is set just before a task starts tearing down it's signal
>> handling.
>>
>> So delivering any signal, or otherwise depending on signal handling
>> after PF_EXITING is set can not be done.  That abstraction is gone.
>
> I see, so you suggest it's the driver responsibility to avoid creating
> such code path that ends up
> calling wait_event_killable from exit call stack (PF_EXITING == 1) ?

I don't just suggest.

I am saying clearly that any dependency on receiving SIGKILL after
PF_EXITING is set is a bug.

It looks safe (the bitmap is not freed) to use wait_event_killable on a
dual use code path, but you can't expect SIGKILL ever to be delivered
during fop->release, as f_op->release is called from exit after signal
handling has been shutdown.

The best generic code could do would be to always have
fatal_signal_pending return true after PF_EXITING is set.

Increasingly I am thinking that drm_sched_entity_fini should have a
wait_event_timeout or no wait at all.  The cleanup code should have
a progress guarantee of it's own.

Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/3] signals: Allow generation of SIGKILL to exiting task.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> Currently calling wait_event_killable as part of exiting process
> will stall forever since SIGKILL generation is suppresed by PF_EXITING.
>
> In our partilaur case AMDGPU driver wants to flush all GPU jobs in
> flight before shutting down. But if some job hangs the pipe we still want to
> be able to kill it and avoid a process in D state.

This makes me profoundly uncomfotable.  You are changing the linux
semantics of what it means for a process to be exiting.  Functionally
this may require all kinds of changes to when we allow processes to stop
processing signals.

So without a really good thought out explanation that takes into account
all of the issues involved in process exiting and posix conformance.

Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

Eric

> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> ---
>  kernel/signal.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c6e4c83..c49c706 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -886,10 +886,10 @@ static inline int wants_signal(int sig, struct 
> task_struct *p)
>  {
>   if (sigismember(>blocked, sig))
>   return 0;
> - if (p->flags & PF_EXITING)
> - return 0;
>   if (sig == SIGKILL)
>   return 1;
> + if (p->flags & PF_EXITING)
> + return 0;
>   if (task_is_stopped_or_traced(p))
>   return 0;
>   return task_curr(p) || !signal_pending(p);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
"Panariti, David"  writes:

> Andrey Grodzovsky  writes:
>> Kind of dma_fence_wait_killable, except that we don't have such API
>> (maybe worth adding ?)
> Depends on how many places it would be called, or think it might be called.  
> Can always factor on the 2nd time it's needed.
> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
> using `M-.' ;->
>
> Also, if the wait could be very long, would a log message, something like 
> "xxx has run for Y seconds."  help?
> I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.

Eric


>> If the ring is hanging for some reason allow to recover the waiting by 
>> sending fatal signal.
>>
>> Originally-by: David Panariti 
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index eb80edf..37a36af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
>> unsigned ring_id)
>>
>>   if (other) {
>>   signed long r;
>> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
>> - if (r < 0) {
>> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
>> - return r;
>> +
>> + while (true) {
>> + if ((r = dma_fence_wait_timeout(other, true,
>> + MAX_SCHEDULE_TIMEOUT)) >= 0)
>> + return 0;
>> +
>> + if (fatal_signal_pending(current)) {
>> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
>> r);
>> + return r;
>> + }
>>   }
>>   }
>>
>> --
>> 2.7.4
>>
Eric
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/3] signals: Allow generation of SIGKILL to exiting task.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> Currently calling wait_event_killable as part of exiting process
> will stall forever since SIGKILL generation is suppresed by PF_EXITING.
>
> In our partilaur case AMDGPU driver wants to flush all GPU jobs in
> flight before shutting down. But if some job hangs the pipe we still want to
> be able to kill it and avoid a process in D state.

I should clarify.  This absolutely can not be done.
PF_EXITING is set just before a task starts tearing down it's signal
handling.

So delivering any signal, or otherwise depending on signal handling
after PF_EXITING is set can not be done.  That abstraction is gone.

Eric

> Signed-off-by: Andrey Grodzovsky 
> ---
>  kernel/signal.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c6e4c83..c49c706 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -886,10 +886,10 @@ static inline int wants_signal(int sig, struct 
> task_struct *p)
>  {
>   if (sigismember(>blocked, sig))
>   return 0;
> - if (p->flags & PF_EXITING)
> - return 0;
>   if (sig == SIGKILL)
>   return 1;
> + if (p->flags & PF_EXITING)
> + return 0;
>   if (task_is_stopped_or_traced(p))
>   return 0;
>   return task_curr(p) || !signal_pending(p);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:

> On 04/24/2018 12:23 PM, Eric W. Biederman wrote:
>> Andrey Grodzovsky <andrey.grodzov...@amd.com> writes:
>>
>>> Avoid calling wait_event_killable when you are possibly being called
>>> from get_signal routine since in that case you end up in a deadlock
>>> where you are alreay blocked in singla processing any trying to wait
>>> on a new signal.
>> I am curious what the call path that is problematic here.
>
> Here is the problematic call stack
>
> [<0>] drm_sched_entity_fini+0x10a/0x3a0 [gpu_sched]
> [<0>] amdgpu_ctx_do_release+0x129/0x170 [amdgpu]
> [<0>] amdgpu_ctx_mgr_fini+0xd5/0xe0 [amdgpu]
> [<0>] amdgpu_driver_postclose_kms+0xcd/0x440 [amdgpu]
> [<0>] drm_release+0x414/0x5b0 [drm]
> [<0>] __fput+0x176/0x350
> [<0>] task_work_run+0xa1/0xc0
> [<0>] do_exit+0x48f/0x1280
> [<0>] do_group_exit+0x89/0x140
> [<0>] get_signal+0x375/0x8f0
> [<0>] do_signal+0x79/0xaa0
> [<0>] exit_to_usermode_loop+0x83/0xd0
> [<0>] do_syscall_64+0x244/0x270
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0x
>
> On exit from system call you process all the signals you received and
> encounter a fatal signal which triggers process termination.
>
>>
>> In general waiting seems wrong when the process has already been
>> fatally killed as indicated by PF_SIGNALED.
>
> So indeed this patch avoids wait in this case.
>
>>
>> Returning -ERESTARTSYS seems wrong as nothing should make it back even
>> to the edge of userspace here.
>
> Can you clarify please - what should be returned here instead ?

__fput does not have a return code.  I don't see the return code of
release being used anywhere.  So any return code is going to be lost.
So maybe something that talks to the drm/kernel layer but don't expect
your system call to be restarted, which is what -ERESTARTSYS asks for.

Hmm.  When looking at the code that is merged versus whatever your
patch is against it gets even clearer.  The -ERESTARTSYS
return code doesn't even get out of drm_sched_entity_fini.

Caring at all about process state at that point is wrong, as except for
being in ``process'' context where you can sleep nothing is connected to
a process.

Let me respectfully suggest that the wait_event_killable on that code
path is wrong.  Possibly you want a wait_event_timeout if you are very
nice.  But the code already has the logic necessary to handle what
happens if it can't sleep.

So I think the justification needs to be why you are trying to sleep
there at all.

The progress guarantee needs to come from the gpu layer or the AMD
driver not from someone getting impatient and sending SIGKILL to
a dead process.


Eric


>>
>> Given that this is the only use of PF_SIGNALED outside of bsd process
>> accounting I find this code very suspicious.
>>
>> It looks the code path that gets called during exit is buggy and needs
>> to be sorted out.
>>
>> Eric
>>
>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 088ff2b..09fd258 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
>>> drm_gpu_scheduler *sched,
>>> return;
>>> /**
>>>  * The client will not queue more IBs during this fini, consume existing
>>> -* queued IBs or discard them on SIGKILL
>>> +* queued IBs or discard them when in death signal state since
>>> +* wait_event_killable can't receive signals in that state.
>>> */
>>> -   if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
>>> +   if (current->flags & PF_SIGNALED)
>>> entity->fini_status = -ERESTARTSYS;
>>> else
>>> entity->fini_status = wait_event_killable(sched->job_scheduled,
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> If the ring is hanging for some reason allow to recover the waiting
> by sending fatal signal.
>
> Originally-by: David Panariti 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>  
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }

It looks like if you make this code say:
if (fatal_signal_pending(current) ||
(current->flags & PF_EXITING)) {
DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
return r;
>   }
>   }

Than you would not need the horrible hack want_signal to deliver signals
to processes who have passed exit_signal() and don't expect to need
their signal handling mechanisms anymore.

Eric

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx