On 08/31/2017 11:19 AM, Cornelia Huck wrote:
> On Wed, 30 Aug 2017 18:36:02 +0200
> Halil Pasic <pa...@linux.vnet.ibm.com> wrote:
> 
>> According to the POP a start subchannel instruction (SSCH) returning with
>> cc 1 implies that the subchannel was status pending when SSCH executed.
>>
>> Due to a somewhat confusing error handling, where error codes are mapped
>> to cc value, sane looking error codes result in non AR compliant
>> behavior.
>>
>> Let's fix this! Instead of cc 1 we use cc 3 which means device not
>> operational, and is much closer to the truth in the given cases.
>>
>> Signed-off-by: Halil Pasic <pa...@linux.vnet.ibm.com>
>> Acked-by: Pierre Morel<pmo...@linux.vnet.ibm.com>
>> ---
>>
>> This patch turned out quite controversial. We did not reach a consensus
>> during the internal review.
>>
>> The most of the discussion revolved around the ORB flag which
>> architecturally must be supported, but are currently not supported by
>> vfio-ccw (not yet, or can't be). The idea showing the most promise for
>> consensus was to handle this via device status (along the lines better a
>> strange acting device than a non-conform machine) but since it's a
>> radical change we decided to first discuss upstream and then do whatever
>> needs to be done.
>> ---
>>  hw/s390x/css.c      | 15 ++++++---------
>>  hw/s390x/s390-ccw.c |  2 +-
>>  2 files changed, 7 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/s390x/css.c b/hw/s390x/css.c
>> index a50fb0727e..0822538cde 100644
>> --- a/hw/s390x/css.c
>> +++ b/hw/s390x/css.c
>> @@ -1034,7 +1034,7 @@ static int sch_handle_start_func_passthrough(SubchDev 
>> *sch)
>>       */
>>      if (!(orb->ctrl0 & ORB_CTRL0_MASK_PFCH) ||
>>          !(orb->ctrl0 & ORB_CTRL0_MASK_C64)) {
>> -        return -EINVAL;
>> +        return -ENODEV;
> 
> This feels wrong. If we don't support this yet, doing something like a
> channel-program check or an operand exception feels closer to the
> architecture than indicating a gone device.

I disagree, a channel-program check or an operand exception, or cc 1
(current solution) makes the machine obviously non-conform.

My train of thought was that architecturally you can loose connection
to the device at any time (you can't prohibit admins pulling cables or
smashing equipment with a 10kg hammer).

Also from the guest OS perspective I think saying device not operational
could provoke a proper reaction form the guest OS: that is just give up
on the device. The things you propose would in my opinion put the blame
on the guest OS driver (making non-conform requests) so in that case
it would make sense to give up on the driver (but the same driver could
wonderfully work with let's say a fully emulated device).

As I have stated in the cover letter of this patch, I would find
setting device status even better, but I wanted to discuss first
before going from setting cc to something else.

Setting cc was not my idea in the first place (AFAIK the -EINVAL
here effectively triggers cc 1).

> 
>>      }
>>  
>>      ret = s390_ccw_cmd_request(orb, s, sch->driver_data);
>> @@ -1046,16 +1046,13 @@ static int 
>> sch_handle_start_func_passthrough(SubchDev *sch)
>>          break;
>>      case -ENODEV:
>>          break;
>> +    case -EFAULT:
>> +         break;
>>      case -EACCES:
>>          /* Let's reflect an inaccessible host device by cc 3. */
>> -        ret = -ENODEV;
>> -        break;
>>      default:
>> -       /*
>> -        * All other return codes will trigger a program check,
>> -        * or set cc to 1.
>> -        */
>> -       break;
>> +        /* Let's make all other return codes map to cc 3.  */
>> +        ret = -ENODEV;
> 
> Why? This feels wrong. For those cases where we want to signal an error
> but cc 1 is conceptually wrong, either an operand exception (for very
> few cases) or a channel-program check feels more in line with the
> architecture.

You mean the original code feels wrong, or? I keep the program check for
-EFAULT (that's why it's added) and just change cc 1 to cc 3 for the
not explicitly handled error codes (reason stated in the commit message).

> 
> That's a general problem with doing stuff in the hypervisor: We have
> sets of internal problems that obviously don't show up in the
> architecture, and we can either handle them internally or use what the
> architecture offers for problem signaling. z/VM has probably faced the
> same problems :)

I agree.

> 
>>      };
>>  
>>      return ret;
>> @@ -1115,7 +1112,7 @@ static int do_subchannel_work(SubchDev *sch)
>>      if (sch->do_subchannel_work) {
>>          return sch->do_subchannel_work(sch);
>>      } else {
>> -        return -EINVAL;
>> +        return -ENODEV;
> 
> This rather seems like a job for an assert? If we don't have a function
> for the 'asynchronous' handling of the various functions assigned for a
> subchannel, that looks like an internal error.
> 

IMHO it depends. Aborting qemu is heavy handed, and as an user I would not
be happy about it. But certainly it is an assert situation.  We can look for
an even better solution, but I think this is an improvement. The logic behind
is that the device is broken and can't be talked to properly.

>>      }
>>  }
>>  
>> diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
>> index 8614dda6f8..2b0741741c 100644
>> --- a/hw/s390x/s390-ccw.c
>> +++ b/hw/s390x/s390-ccw.c
>> @@ -25,7 +25,7 @@ int s390_ccw_cmd_request(ORB *orb, SCSW *scsw, void *data)
>>      if (cdc->handle_request) {
>>          return cdc->handle_request(orb, scsw, data);
>>      } else {
>> -        return -ENOSYS;
>> +        return -ENODEV;
> 
> If we get here, it means that we called a request handler (which is
> only done for the passthrough variety) without having assigned a
> request handler beforehand. This also looks like an internal error to
> me...
> 

Certainly. Again I was not the one who wrote or accepted the original
code. My previous comment about whether assert or not applies here as
well.

>>      }
>>  }
>>  
> 
> 


Reply via email to