On Thu, Apr 20, 2017 at 2:18 PM, Robert Foss <[email protected]> wrote:
> Hi Björn,
>
> Thanks for the thorough and explicit feedback, it was rather helpful.
>
>
> On 2017-04-20 04:32 AM, Bjørn Mork wrote:
>>
>> Hello Robert,
>>
>> Sorry for being much too late here, but during recent attemts to debug
>> issues caused by my commit 833415a3e781 ("cdc-wdm: fix "out-of-sync" due
>> to missing notifications") I believe I found a couple of issues with
>> commit c1da59dad0eb. At least one of them is serious (potentional
>> GPF_KERNEL allocation in interrupt context).
>>
>> But I don't know if I understand the problem the commit set out to
>> solves, which is why I have no proposed patch here.
>
>
>
> The description of the situation that I received was then one
> from the commit message:
>
> "In case of a read error, userspace may not actually read the data, since
> the
> driver returns an error through wdm_poll. After this, the underlying device
> may
> attempt to send us more data, but the queue is not processed. While
> userspace is
> also blocked, because the read error is never cleared."
>
> So the way I interpret it is that after an error condition the device can
> get prevented from sending data, which in turn can cause userspace to not
> issue a read (which would have cleared the error).
>
> I could certainly be wrong in this interpretation or be misunderstanding
> something.
>
> Prathmesh, Guenter: Do you have any comments?
>
I would not want to claim to fully understand the problem, but from
Bjørn's comments it really looks like the patch should be reverted.
Thanks,
Guenter
>>
>>
>>> @@ -189,7 +192,13 @@ static void wdm_in_callback(struct urb *urb)
>>> }
>>> }
>>>
>>> - desc->rerr = status;
>>> + /*
>>> + * only set a new error if there is no previous error.
>>> + * Errors are only cleared during read/open
>>> + */
>>> + if (desc->rerr == 0)
>>> + desc->rerr = status;
>>> +
>>> if (length + desc->length > desc->wMaxCommand) {
>>> /* The buffer would overflow */
>>> set_bit(WDM_OVERFLOW, &desc->flags);
>>>
>>
>>
>> This is minor, but in my opinion this change is flawed. It changes the
>> meaning of "desc->rerr" from the original "last status code" to the new
>> "first error since last userspace read".
>>
>> wdm_read will only return and clear such errors if desc->length == 0 on
>> entry. This is up to the userspace application to decide. I claim that
>> any error should either be reported immediately or not at all. Caching
>> errors forever like this, and then reporting them to userspace at some
>> arbitrary later event, is pointless.
>>
>> The existing code did the correct thing in the absence of proper event
>> queing. The only real alternative would be to completely stop
>> processing device events until userspace has seen the error. But that
>> would just make things fail completely for no good reason.
>>
>> We should probably consider making a queue/list of the read buffers and
>> keep each status code with the associated list element. Then we could
>> report back the error when userspace reads the element that caused it.
>
>
> I see your point about changing the semantics of desc->rerr, and agree that
> it probably isn't desirable.
>
> So an event queue is starting to look like the way forward.
>
>>
>>> @@ -221,9 +230,19 @@ static void wdm_in_callback(struct urb *urb)
>>> }
>>>
>>> skip_error:
>>> + set_bit(WDM_READ, &desc->flags);
>>> wake_up(&desc->wait);
>>>
>>> - set_bit(WDM_READ, &desc->flags);
>>> + if (desc->rerr) {
>>> + /*
>>> + * Since there was an error, userspace may decide to not
>>> read
>>> + * any data after poll'ing.
>>> + * We should respond to further attempts from the device
>>> to send
>>> + * data, so that we can get unstuck.
>>> + */
>>> + service_outstanding_interrupt(desc);
>>> + }
>>> +
>>> unlock:
>>> spin_unlock(&desc->iuspin);
>>> }
>>
>>
>> We cannot do this. This is an URB callback, and
>> service_outstanding_interrupt()
>> calls usb_submit_urb(...,GFP_KERNEL).
>>
>> This issue could of course be avoided by simply changing the allocation
>> flags. But looking at the comment here, I am pretty sure that this is
>> the wrong solution to the unanticipated userspace behaviour. If
>> userspace decides not to read() if poll() sets POLLERR, then it *should
>> not* expect the error condition to be cleared. Or am I missing something?
>
>
> The allocation flags aside, calling poll() and it returning POLLERR should
> not clear the error, so the behavior of the
>
>>
>> In any case: This will not unstick anything, given the previous problem
>> with desc->rerr not being updated unless userspace reads. You will just
>> flush the device queue, but the error condition is still there. So
>> poll() will still set POLLERR. If the comment is correct, this is a
>> permanent deadlock.
>>
>> userspace can avoid tge issue by reading the descriptor on POLLERR. If
>> that is not the way this is expected to work (I am by no means a POSIX
>> expert), then I believe the correct fix must be to change wdm_poll(),
>> either to clear the error condition or to not set POLLERR in this case.
>
>
> So, in lieu of an actual event queue, maybe this patch should be reverted.
> Thoughts?
>
>
> Rob.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html