Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:53:44 +0800
Cao jin  wrote:

> On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:  
> >> On Thu, 6 Apr 2017 00:50:22 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
>  The previous intention of trying to handle all sorts of AER faults
>  clearly had more value, though even there the implementation and
>  configuration requirements restricted the practicality.  For instance
>  is AER support actually useful to a customer if it requires all ports
>  of a multifunction device assigned to the VM?  This seems more like a
>  feature targeting whole system partitioning rather than general VM
>  device assignment use cases.  Maybe that's ok, but it should be a clear
>  design decision.
> >>>
> >>> Alex, what kind of testing do you expect to be necessary?
> >>> Would you say testing on real hardware and making it trigger
> >>> AER errors is a requirement?  
> >>
> >> Testing various fatal, non-fatal, and corrected errors with aer-inject,
> >> especially in multfunction configurations (where more than one port
> >> is actually usable) would certainly be required.  If we have cases where
> >> the driver for a companion function can escalate a non-fatal error to a
> >> bus reset, that should be tested, even if it requires temporary hacks to
> >> the host driver for the companion function to trigger that case.  AER
> >> handling is not something that the typical user is going to experience,
> >> so it should to be thoroughly tested to make sure it works when needed
> >> or there's little point to doing it at all.  Thanks,
> >>
> >> Alex  
> > 
> > Some things can be tested within a VM. What would you
> > say would be sufficient on a VM and what has to be
> > tested on bare metal?
> >   
> 
> Does the "bare metal" here mean something like XenServer?

No, bare metal means the non-virtualized host OS.  I think Michael was
trying to facilitate testing by proposing to do it in a VM such that we
can create strange and interesting topologies that aren't bound by a
system in a remote lab having only one NIC port connected.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:53:44 +0800
Cao jin  wrote:

> On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:  
> >> On Thu, 6 Apr 2017 00:50:22 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
>  The previous intention of trying to handle all sorts of AER faults
>  clearly had more value, though even there the implementation and
>  configuration requirements restricted the practicality.  For instance
>  is AER support actually useful to a customer if it requires all ports
>  of a multifunction device assigned to the VM?  This seems more like a
>  feature targeting whole system partitioning rather than general VM
>  device assignment use cases.  Maybe that's ok, but it should be a clear
>  design decision.
> >>>
> >>> Alex, what kind of testing do you expect to be necessary?
> >>> Would you say testing on real hardware and making it trigger
> >>> AER errors is a requirement?  
> >>
> >> Testing various fatal, non-fatal, and corrected errors with aer-inject,
> >> especially in multfunction configurations (where more than one port
> >> is actually usable) would certainly be required.  If we have cases where
> >> the driver for a companion function can escalate a non-fatal error to a
> >> bus reset, that should be tested, even if it requires temporary hacks to
> >> the host driver for the companion function to trigger that case.  AER
> >> handling is not something that the typical user is going to experience,
> >> so it should to be thoroughly tested to make sure it works when needed
> >> or there's little point to doing it at all.  Thanks,
> >>
> >> Alex  
> > 
> > Some things can be tested within a VM. What would you
> > say would be sufficient on a VM and what has to be
> > tested on bare metal?
> >   
> 
> Does the "bare metal" here mean something like XenServer?

No, bare metal means the non-virtualized host OS.  I think Michael was
trying to facilitate testing by proposing to do it in a VM such that we
can create strange and interesting topologies that aren't bound by a
system in a remote lab having only one NIC port connected.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:49:35 +0800
Cao jin  wrote:

> On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:  
> >> Apparently, I don't have experience to induce non-fatal error, device
> >> error is more of a chance related with the environment(temperature,
> >> humidity, etc) as I understand.  
> > 
> > I'm not sure how to interpret this statement. I think what Alex is
> > saying is simply that patches should include some justification. They
> > make changes but what are they improving?
> > For example:
> > 
> > I tested device ABC in conditions DEF. Without a patch VM
> > stops. With the patches applied VM recovers and proceeds to
> > use the device normally.
> > 
> > is one reasonable justification imho.
> >   
> 
> Got it. But unfortunately, until now, I haven't seen a VM stop caused by
> a real device non-fatal error during device assignment(Only saw real
> fatal errors after start VM).
> On one side, AER error could occur theoretically; on the other side,
> seldom people have seen a VM stop caused by AER. Now I am asked that do
> I have a real evidence or scenario to prove that this patchset is really
> useful? I don't, and we all know it is hard to trigger a real hardware
> error, so, seems I am pushed into the corner.  I guess these questions
> also apply for AER driver's author, if the scenario is easy to
> reproduce, there is no need to write aer_inject to fake errors.

Yes, non-fatal errors are rare, so rare in fact that I wonder if
anything actually implements them.  Could we take a survey of Linux
drivers with AER support and see which ones actually do anything useful
on a non-fatal error?  I know fatal errors exist, I've seen them.  The
situation I want to avoid is adding non-fatal AER support that never
gets used and just gets in the way of the next attempt to support fatal
error recovery.  We're extending the user ABI here, which means that
for any feature we add, we need to consider how we'll support it long
term and how we'll deprecate it or further extend it when the next
level of support comes.  I'm reluctant to add code that only partially
solves a problem and only addresses the side of the problem that we're
not even sure occurs.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:49:35 +0800
Cao jin  wrote:

> On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:  
> >> Apparently, I don't have experience to induce non-fatal error, device
> >> error is more of a chance related with the environment(temperature,
> >> humidity, etc) as I understand.  
> > 
> > I'm not sure how to interpret this statement. I think what Alex is
> > saying is simply that patches should include some justification. They
> > make changes but what are they improving?
> > For example:
> > 
> > I tested device ABC in conditions DEF. Without a patch VM
> > stops. With the patches applied VM recovers and proceeds to
> > use the device normally.
> > 
> > is one reasonable justification imho.
> >   
> 
> Got it. But unfortunately, until now, I haven't seen a VM stop caused by
> a real device non-fatal error during device assignment(Only saw real
> fatal errors after start VM).
> On one side, AER error could occur theoretically; on the other side,
> seldom people have seen a VM stop caused by AER. Now I am asked that do
> I have a real evidence or scenario to prove that this patchset is really
> useful? I don't, and we all know it is hard to trigger a real hardware
> error, so, seems I am pushed into the corner.  I guess these questions
> also apply for AER driver's author, if the scenario is easy to
> reproduce, there is no need to write aer_inject to fake errors.

Yes, non-fatal errors are rare, so rare in fact that I wonder if
anything actually implements them.  Could we take a survey of Linux
drivers with AER support and see which ones actually do anything useful
on a non-fatal error?  I know fatal errors exist, I've seen them.  The
situation I want to avoid is adding non-fatal AER support that never
gets used and just gets in the way of the next attempt to support fatal
error recovery.  We're extending the user ABI here, which means that
for any feature we add, we need to consider how we'll support it long
term and how we'll deprecate it or further extend it when the next
level of support comes.  I'm reluctant to add code that only partially
solves a problem and only addresses the side of the problem that we're
not even sure occurs.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 03:38 AM, Alex Williamson wrote:
> On Wed, 5 Apr 2017 16:54:33 +0800
> Cao jin  wrote:
> 
>> Sorry for late. Distracted by other problem for a while.
>>
>> On 03/31/2017 02:16 AM, Alex Williamson wrote:
>>> On Thu, 30 Mar 2017 21:00:35 +0300  
>>
   
>
> I also asked in my previous comments to provide examples of errors 
> that
> might trigger correctable errors to the user, this comment seems to
> have been missed.  In my experience, AERs generated during device
> assignment are generally hardware faults or induced by bad guest
> drivers.  These are cases where a single fatal error is an appropriate
> and sufficient response.  We've scaled back this support to the point
> where we're only improving the situation of correctable errors and I'm
> not convinced this is worthwhile and we're not simply checking a box 
> on
> an ill-conceived marketing requirements document.

 Sorry. I noticed that question: "what actual errors do we expect
 userspace to see as non-fatal errors?", but I am confused about it.
 Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
 and Uncorrectable Error Severity Register will tell which is fatal, and
 which is non-fatal, this register is configurable, they are device
 specific as I guess. AER core driver distinguish them by
 pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand 
 your
 question. Or

 Or, Do you mean we could list the default non-fatal error of
 Uncorrectable Error Severity Register which is provided by PCIe spec?  
 
>>>
>>> I'm trying to ask why is this patch series useful.  It's clearly
>>> possible for us to signal non-fatal errors for a device to a guest, but
>>> why is it necessarily a good idea to do so?  What additional RAS
>>> feature is gained by this?  Can we give a single example of a real
>>> world scenario where a guest has been shutdown due to a non-fatal error
>>> that the guest driver would have handled?  
>>
>> We've been discussing AER for months if not years.
>> Isn't it a bit too late to ask whether AER recovery
>> by guests it useful at all?
>
>
> Years, but I think that is more indicative of the persistence of the
> developers rather than growing acceptance on my part.  For the majority
> of that we were headed down the path of full AER support with the guest
> able to invoke bus resets.  It was a complicated solution, but it was
> more clear that it had some value.   Of course that's been derailed
> due to various issues and we're now on this partial implementation that
> only covers non-fatal errors that we assume the guest can recover from
> without providing it mechanisms to do bus resets.  Is there actual
> value to this or are we just trying to fill an AER checkbox on
> someone's marketing sheet?  I don't think it's too much to ask for a
> commit log to include evidence or discussion about how a feature is
> actually a benefit to implement.

 Seems rather self evident but ok.  So something like

 With this patch, guest is able to recover from non-fatal correctable
 errors - as opposed to stopping the guest with no ability to
 recover which was the only option previously.

 Would this address your question?  
>>>
>>>
>>> No, that's just restating the theoretical usefulness of this.  Have you
>>> ever seen a non-fatal error?  Does it ever trigger?  If we can't
>>> provide a real world case of this being useful, can we at least discuss
>>> the types of things that might trigger a non-fatal error for which the
>>> guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
>>> of reducing VM stop conditions, but I suspect this is just a misuse of
>>> statistics, just because there are two choices, fatal vs non-fatal,
>>> does not make them equally likely.  Do we have any idea of the
>>> incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
>>>   
>>
>> Apparently, I don't have experience to induce non-fatal error, device
>> error is more of a chance related with the environment(temperature,
>> humidity, etc) as I understand.
>>
>> After reading the discussion, can I construe that the old design with
>> full AER support is preferred than this new one?  The core issue of the
>> old design is that the second host link reset make the subsequent
>> guest's register reading fail, and I think this can be solved by test
>> the device's accessibility(read device' register, all F's means
>> unaccessible. IIRC, EEH of Power also use this way to test device's
>> accessiblity) and delay guest's reading if device is temporarily
>> unaccessible.
> 
> What is the actual AER handling 

Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 03:38 AM, Alex Williamson wrote:
> On Wed, 5 Apr 2017 16:54:33 +0800
> Cao jin  wrote:
> 
>> Sorry for late. Distracted by other problem for a while.
>>
>> On 03/31/2017 02:16 AM, Alex Williamson wrote:
>>> On Thu, 30 Mar 2017 21:00:35 +0300  
>>
   
>
> I also asked in my previous comments to provide examples of errors 
> that
> might trigger correctable errors to the user, this comment seems to
> have been missed.  In my experience, AERs generated during device
> assignment are generally hardware faults or induced by bad guest
> drivers.  These are cases where a single fatal error is an appropriate
> and sufficient response.  We've scaled back this support to the point
> where we're only improving the situation of correctable errors and I'm
> not convinced this is worthwhile and we're not simply checking a box 
> on
> an ill-conceived marketing requirements document.

 Sorry. I noticed that question: "what actual errors do we expect
 userspace to see as non-fatal errors?", but I am confused about it.
 Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
 and Uncorrectable Error Severity Register will tell which is fatal, and
 which is non-fatal, this register is configurable, they are device
 specific as I guess. AER core driver distinguish them by
 pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand 
 your
 question. Or

 Or, Do you mean we could list the default non-fatal error of
 Uncorrectable Error Severity Register which is provided by PCIe spec?  
 
>>>
>>> I'm trying to ask why is this patch series useful.  It's clearly
>>> possible for us to signal non-fatal errors for a device to a guest, but
>>> why is it necessarily a good idea to do so?  What additional RAS
>>> feature is gained by this?  Can we give a single example of a real
>>> world scenario where a guest has been shutdown due to a non-fatal error
>>> that the guest driver would have handled?  
>>
>> We've been discussing AER for months if not years.
>> Isn't it a bit too late to ask whether AER recovery
>> by guests it useful at all?
>
>
> Years, but I think that is more indicative of the persistence of the
> developers rather than growing acceptance on my part.  For the majority
> of that we were headed down the path of full AER support with the guest
> able to invoke bus resets.  It was a complicated solution, but it was
> more clear that it had some value.   Of course that's been derailed
> due to various issues and we're now on this partial implementation that
> only covers non-fatal errors that we assume the guest can recover from
> without providing it mechanisms to do bus resets.  Is there actual
> value to this or are we just trying to fill an AER checkbox on
> someone's marketing sheet?  I don't think it's too much to ask for a
> commit log to include evidence or discussion about how a feature is
> actually a benefit to implement.

 Seems rather self evident but ok.  So something like

 With this patch, guest is able to recover from non-fatal correctable
 errors - as opposed to stopping the guest with no ability to
 recover which was the only option previously.

 Would this address your question?  
>>>
>>>
>>> No, that's just restating the theoretical usefulness of this.  Have you
>>> ever seen a non-fatal error?  Does it ever trigger?  If we can't
>>> provide a real world case of this being useful, can we at least discuss
>>> the types of things that might trigger a non-fatal error for which the
>>> guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
>>> of reducing VM stop conditions, but I suspect this is just a misuse of
>>> statistics, just because there are two choices, fatal vs non-fatal,
>>> does not make them equally likely.  Do we have any idea of the
>>> incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
>>>   
>>
>> Apparently, I don't have experience to induce non-fatal error, device
>> error is more of a chance related with the environment(temperature,
>> humidity, etc) as I understand.
>>
>> After reading the discussion, can I construe that the old design with
>> full AER support is preferred than this new one?  The core issue of the
>> old design is that the second host link reset make the subsequent
>> guest's register reading fail, and I think this can be solved by test
>> the device's accessibility(read device' register, all F's means
>> unaccessible. IIRC, EEH of Power also use this way to test device's
>> accessiblity) and delay guest's reading if device is temporarily
>> unaccessible.
> 
> What is the actual AER handling requirement you're trying to solve?  

Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:
>> Apparently, I don't have experience to induce non-fatal error, device
>> error is more of a chance related with the environment(temperature,
>> humidity, etc) as I understand.
> 
> I'm not sure how to interpret this statement. I think what Alex is
> saying is simply that patches should include some justification. They
> make changes but what are they improving?
> For example:
> 
>   I tested device ABC in conditions DEF. Without a patch VM
>   stops. With the patches applied VM recovers and proceeds to
>   use the device normally.
> 
> is one reasonable justification imho.
> 

Got it. But unfortunately, until now, I haven't seen a VM stop caused by
a real device non-fatal error during device assignment(Only saw real
fatal errors after start VM).
On one side, AER error could occur theoretically; on the other side,
seldom people have seen a VM stop caused by AER. Now I am asked that do
I have a real evidence or scenario to prove that this patchset is really
useful? I don't, and we all know it is hard to trigger a real hardware
error, so, seems I am pushed into the corner.  I guess these questions
also apply for AER driver's author, if the scenario is easy to
reproduce, there is no need to write aer_inject to fake errors.

-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:
>> Apparently, I don't have experience to induce non-fatal error, device
>> error is more of a chance related with the environment(temperature,
>> humidity, etc) as I understand.
> 
> I'm not sure how to interpret this statement. I think what Alex is
> saying is simply that patches should include some justification. They
> make changes but what are they improving?
> For example:
> 
>   I tested device ABC in conditions DEF. Without a patch VM
>   stops. With the patches applied VM recovers and proceeds to
>   use the device normally.
> 
> is one reasonable justification imho.
> 

Got it. But unfortunately, until now, I haven't seen a VM stop caused by
a real device non-fatal error during device assignment(Only saw real
fatal errors after start VM).
On one side, AER error could occur theoretically; on the other side,
seldom people have seen a VM stop caused by AER. Now I am asked that do
I have a real evidence or scenario to prove that this patchset is really
useful? I don't, and we all know it is hard to trigger a real hardware
error, so, seems I am pushed into the corner.  I guess these questions
also apply for AER driver's author, if the scenario is easy to
reproduce, there is no need to write aer_inject to fake errors.

-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
>> On Thu, 6 Apr 2017 00:50:22 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
 The previous intention of trying to handle all sorts of AER faults
 clearly had more value, though even there the implementation and
 configuration requirements restricted the practicality.  For instance
 is AER support actually useful to a customer if it requires all ports
 of a multifunction device assigned to the VM?  This seems more like a
 feature targeting whole system partitioning rather than general VM
 device assignment use cases.  Maybe that's ok, but it should be a clear
 design decision.  
>>>
>>> Alex, what kind of testing do you expect to be necessary?
>>> Would you say testing on real hardware and making it trigger
>>> AER errors is a requirement?
>>
>> Testing various fatal, non-fatal, and corrected errors with aer-inject,
>> especially in multfunction configurations (where more than one port
>> is actually usable) would certainly be required.  If we have cases where
>> the driver for a companion function can escalate a non-fatal error to a
>> bus reset, that should be tested, even if it requires temporary hacks to
>> the host driver for the companion function to trigger that case.  AER
>> handling is not something that the typical user is going to experience,
>> so it should to be thoroughly tested to make sure it works when needed
>> or there's little point to doing it at all.  Thanks,
>>
>> Alex
> 
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?
> 

Does the "bare metal" here mean something like XenServer?
-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Cao jin


On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
>> On Thu, 6 Apr 2017 00:50:22 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
 The previous intention of trying to handle all sorts of AER faults
 clearly had more value, though even there the implementation and
 configuration requirements restricted the practicality.  For instance
 is AER support actually useful to a customer if it requires all ports
 of a multifunction device assigned to the VM?  This seems more like a
 feature targeting whole system partitioning rather than general VM
 device assignment use cases.  Maybe that's ok, but it should be a clear
 design decision.  
>>>
>>> Alex, what kind of testing do you expect to be necessary?
>>> Would you say testing on real hardware and making it trigger
>>> AER errors is a requirement?
>>
>> Testing various fatal, non-fatal, and corrected errors with aer-inject,
>> especially in multfunction configurations (where more than one port
>> is actually usable) would certainly be required.  If we have cases where
>> the driver for a companion function can escalate a non-fatal error to a
>> bus reset, that should be tested, even if it requires temporary hacks to
>> the host driver for the companion function to trigger that case.  AER
>> handling is not something that the typical user is going to experience,
>> so it should to be thoroughly tested to make sure it works when needed
>> or there's little point to doing it at all.  Thanks,
>>
>> Alex
> 
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?
> 

Does the "bare metal" here mean something like XenServer?
-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Thu, 6 Apr 2017 01:36:31 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> > On Thu, 6 Apr 2017 00:50:22 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
> > > > The previous intention of trying to handle all sorts of AER faults
> > > > clearly had more value, though even there the implementation and
> > > > configuration requirements restricted the practicality.  For instance
> > > > is AER support actually useful to a customer if it requires all ports
> > > > of a multifunction device assigned to the VM?  This seems more like a
> > > > feature targeting whole system partitioning rather than general VM
> > > > device assignment use cases.  Maybe that's ok, but it should be a clear
> > > > design decision.
> > > 
> > > Alex, what kind of testing do you expect to be necessary?
> > > Would you say testing on real hardware and making it trigger
> > > AER errors is a requirement?  
> > 
> > Testing various fatal, non-fatal, and corrected errors with aer-inject,
> > especially in multfunction configurations (where more than one port
> > is actually usable) would certainly be required.  If we have cases where
> > the driver for a companion function can escalate a non-fatal error to a
> > bus reset, that should be tested, even if it requires temporary hacks to
> > the host driver for the companion function to trigger that case.  AER
> > handling is not something that the typical user is going to experience,
> > so it should to be thoroughly tested to make sure it works when needed
> > or there's little point to doing it at all.  Thanks,
> > 
> > Alex  
> 
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?

Testing on a VM could be interesting for development, but I'd expect
bare metal for validation, no offense.  Bus reset timing can be
different, error propagation can be different, etc.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Thu, 6 Apr 2017 01:36:31 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> > On Thu, 6 Apr 2017 00:50:22 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
> > > > The previous intention of trying to handle all sorts of AER faults
> > > > clearly had more value, though even there the implementation and
> > > > configuration requirements restricted the practicality.  For instance
> > > > is AER support actually useful to a customer if it requires all ports
> > > > of a multifunction device assigned to the VM?  This seems more like a
> > > > feature targeting whole system partitioning rather than general VM
> > > > device assignment use cases.  Maybe that's ok, but it should be a clear
> > > > design decision.
> > > 
> > > Alex, what kind of testing do you expect to be necessary?
> > > Would you say testing on real hardware and making it trigger
> > > AER errors is a requirement?  
> > 
> > Testing various fatal, non-fatal, and corrected errors with aer-inject,
> > especially in multfunction configurations (where more than one port
> > is actually usable) would certainly be required.  If we have cases where
> > the driver for a companion function can escalate a non-fatal error to a
> > bus reset, that should be tested, even if it requires temporary hacks to
> > the host driver for the companion function to trigger that case.  AER
> > handling is not something that the typical user is going to experience,
> > so it should to be thoroughly tested to make sure it works when needed
> > or there's little point to doing it at all.  Thanks,
> > 
> > Alex  
> 
> Some things can be tested within a VM. What would you
> say would be sufficient on a VM and what has to be
> tested on bare metal?

Testing on a VM could be interesting for development, but I'd expect
bare metal for validation, no offense.  Bus reset timing can be
different, error propagation can be different, etc.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> On Thu, 6 Apr 2017 00:50:22 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> > > The previous intention of trying to handle all sorts of AER faults
> > > clearly had more value, though even there the implementation and
> > > configuration requirements restricted the practicality.  For instance
> > > is AER support actually useful to a customer if it requires all ports
> > > of a multifunction device assigned to the VM?  This seems more like a
> > > feature targeting whole system partitioning rather than general VM
> > > device assignment use cases.  Maybe that's ok, but it should be a clear
> > > design decision.  
> > 
> > Alex, what kind of testing do you expect to be necessary?
> > Would you say testing on real hardware and making it trigger
> > AER errors is a requirement?
> 
> Testing various fatal, non-fatal, and corrected errors with aer-inject,
> especially in multfunction configurations (where more than one port
> is actually usable) would certainly be required.  If we have cases where
> the driver for a companion function can escalate a non-fatal error to a
> bus reset, that should be tested, even if it requires temporary hacks to
> the host driver for the companion function to trigger that case.  AER
> handling is not something that the typical user is going to experience,
> so it should to be thoroughly tested to make sure it works when needed
> or there's little point to doing it at all.  Thanks,
> 
> Alex

Some things can be tested within a VM. What would you
say would be sufficient on a VM and what has to be
tested on bare metal?

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:
> On Thu, 6 Apr 2017 00:50:22 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> > > The previous intention of trying to handle all sorts of AER faults
> > > clearly had more value, though even there the implementation and
> > > configuration requirements restricted the practicality.  For instance
> > > is AER support actually useful to a customer if it requires all ports
> > > of a multifunction device assigned to the VM?  This seems more like a
> > > feature targeting whole system partitioning rather than general VM
> > > device assignment use cases.  Maybe that's ok, but it should be a clear
> > > design decision.  
> > 
> > Alex, what kind of testing do you expect to be necessary?
> > Would you say testing on real hardware and making it trigger
> > AER errors is a requirement?
> 
> Testing various fatal, non-fatal, and corrected errors with aer-inject,
> especially in multfunction configurations (where more than one port
> is actually usable) would certainly be required.  If we have cases where
> the driver for a companion function can escalate a non-fatal error to a
> bus reset, that should be tested, even if it requires temporary hacks to
> the host driver for the companion function to trigger that case.  AER
> handling is not something that the typical user is going to experience,
> so it should to be thoroughly tested to make sure it works when needed
> or there's little point to doing it at all.  Thanks,
> 
> Alex

Some things can be tested within a VM. What would you
say would be sufficient on a VM and what has to be
tested on bare metal?

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Thu, 6 Apr 2017 00:50:22 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> > The previous intention of trying to handle all sorts of AER faults
> > clearly had more value, though even there the implementation and
> > configuration requirements restricted the practicality.  For instance
> > is AER support actually useful to a customer if it requires all ports
> > of a multifunction device assigned to the VM?  This seems more like a
> > feature targeting whole system partitioning rather than general VM
> > device assignment use cases.  Maybe that's ok, but it should be a clear
> > design decision.  
> 
> Alex, what kind of testing do you expect to be necessary?
> Would you say testing on real hardware and making it trigger
> AER errors is a requirement?

Testing various fatal, non-fatal, and corrected errors with aer-inject,
especially in multfunction configurations (where more than one port
is actually usable) would certainly be required.  If we have cases where
the driver for a companion function can escalate a non-fatal error to a
bus reset, that should be tested, even if it requires temporary hacks to
the host driver for the companion function to trigger that case.  AER
handling is not something that the typical user is going to experience,
so it should to be thoroughly tested to make sure it works when needed
or there's little point to doing it at all.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Thu, 6 Apr 2017 00:50:22 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> > The previous intention of trying to handle all sorts of AER faults
> > clearly had more value, though even there the implementation and
> > configuration requirements restricted the practicality.  For instance
> > is AER support actually useful to a customer if it requires all ports
> > of a multifunction device assigned to the VM?  This seems more like a
> > feature targeting whole system partitioning rather than general VM
> > device assignment use cases.  Maybe that's ok, but it should be a clear
> > design decision.  
> 
> Alex, what kind of testing do you expect to be necessary?
> Would you say testing on real hardware and making it trigger
> AER errors is a requirement?

Testing various fatal, non-fatal, and corrected errors with aer-inject,
especially in multfunction configurations (where more than one port
is actually usable) would certainly be required.  If we have cases where
the driver for a companion function can escalate a non-fatal error to a
bus reset, that should be tested, even if it requires temporary hacks to
the host driver for the companion function to trigger that case.  AER
handling is not something that the typical user is going to experience,
so it should to be thoroughly tested to make sure it works when needed
or there's little point to doing it at all.  Thanks,

Alex


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:
> Apparently, I don't have experience to induce non-fatal error, device
> error is more of a chance related with the environment(temperature,
> humidity, etc) as I understand.

I'm not sure how to interpret this statement. I think what Alex is
saying is simply that patches should include some justification. They
make changes but what are they improving?
For example:

I tested device ABC in conditions DEF. Without a patch VM
stops. With the patches applied VM recovers and proceeds to
use the device normally.

is one reasonable justification imho.

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:
> Apparently, I don't have experience to induce non-fatal error, device
> error is more of a chance related with the environment(temperature,
> humidity, etc) as I understand.

I'm not sure how to interpret this statement. I think what Alex is
saying is simply that patches should include some justification. They
make changes but what are they improving?
For example:

I tested device ABC in conditions DEF. Without a patch VM
stops. With the patches applied VM recovers and proceeds to
use the device normally.

is one reasonable justification imho.

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> The previous intention of trying to handle all sorts of AER faults
> clearly had more value, though even there the implementation and
> configuration requirements restricted the practicality.  For instance
> is AER support actually useful to a customer if it requires all ports
> of a multifunction device assigned to the VM?  This seems more like a
> feature targeting whole system partitioning rather than general VM
> device assignment use cases.  Maybe that's ok, but it should be a clear
> design decision.

Alex, what kind of testing do you expect to be necessary?
Would you say testing on real hardware and making it trigger
AER errors is a requirement?

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Michael S. Tsirkin
On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:
> The previous intention of trying to handle all sorts of AER faults
> clearly had more value, though even there the implementation and
> configuration requirements restricted the practicality.  For instance
> is AER support actually useful to a customer if it requires all ports
> of a multifunction device assigned to the VM?  This seems more like a
> feature targeting whole system partitioning rather than general VM
> device assignment use cases.  Maybe that's ok, but it should be a clear
> design decision.

Alex, what kind of testing do you expect to be necessary?
Would you say testing on real hardware and making it trigger
AER errors is a requirement?

-- 
MST


Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Wed, 5 Apr 2017 16:54:33 +0800
Cao jin  wrote:

> Sorry for late. Distracted by other problem for a while.
> 
> On 03/31/2017 02:16 AM, Alex Williamson wrote:
> > On Thu, 30 Mar 2017 21:00:35 +0300  
> 
> >>   
> >>>
> >>> I also asked in my previous comments to provide examples of errors 
> >>> that
> >>> might trigger correctable errors to the user, this comment seems to
> >>> have been missed.  In my experience, AERs generated during device
> >>> assignment are generally hardware faults or induced by bad guest
> >>> drivers.  These are cases where a single fatal error is an appropriate
> >>> and sufficient response.  We've scaled back this support to the point
> >>> where we're only improving the situation of correctable errors and I'm
> >>> not convinced this is worthwhile and we're not simply checking a box 
> >>> on
> >>> an ill-conceived marketing requirements document.
> >>
> >> Sorry. I noticed that question: "what actual errors do we expect
> >> userspace to see as non-fatal errors?", but I am confused about it.
> >> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> >> and Uncorrectable Error Severity Register will tell which is fatal, and
> >> which is non-fatal, this register is configurable, they are device
> >> specific as I guess. AER core driver distinguish them by
> >> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand 
> >> your
> >> question. Or
> >>
> >> Or, Do you mean we could list the default non-fatal error of
> >> Uncorrectable Error Severity Register which is provided by PCIe spec?  
> >> 
> >
> > I'm trying to ask why is this patch series useful.  It's clearly
> > possible for us to signal non-fatal errors for a device to a guest, but
> > why is it necessarily a good idea to do so?  What additional RAS
> > feature is gained by this?  Can we give a single example of a real
> > world scenario where a guest has been shutdown due to a non-fatal error
> > that the guest driver would have handled?  
> 
>  We've been discussing AER for months if not years.
>  Isn't it a bit too late to ask whether AER recovery
>  by guests it useful at all?
> >>>
> >>>
> >>> Years, but I think that is more indicative of the persistence of the
> >>> developers rather than growing acceptance on my part.  For the majority
> >>> of that we were headed down the path of full AER support with the guest
> >>> able to invoke bus resets.  It was a complicated solution, but it was
> >>> more clear that it had some value.   Of course that's been derailed
> >>> due to various issues and we're now on this partial implementation that
> >>> only covers non-fatal errors that we assume the guest can recover from
> >>> without providing it mechanisms to do bus resets.  Is there actual
> >>> value to this or are we just trying to fill an AER checkbox on
> >>> someone's marketing sheet?  I don't think it's too much to ask for a
> >>> commit log to include evidence or discussion about how a feature is
> >>> actually a benefit to implement.
> >>
> >> Seems rather self evident but ok.  So something like
> >>
> >> With this patch, guest is able to recover from non-fatal correctable
> >> errors - as opposed to stopping the guest with no ability to
> >> recover which was the only option previously.
> >>
> >> Would this address your question?  
> > 
> > 
> > No, that's just restating the theoretical usefulness of this.  Have you
> > ever seen a non-fatal error?  Does it ever trigger?  If we can't
> > provide a real world case of this being useful, can we at least discuss
> > the types of things that might trigger a non-fatal error for which the
> > guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
> > of reducing VM stop conditions, but I suspect this is just a misuse of
> > statistics, just because there are two choices, fatal vs non-fatal,
> > does not make them equally likely.  Do we have any idea of the
> > incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
> >   
> 
> Apparently, I don't have experience to induce non-fatal error, device
> error is more of a chance related with the environment(temperature,
> humidity, etc) as I understand.
> 
> After reading the discussion, can I construe that the old design with
> full AER support is preferred than this new one?  The core issue of the
> old design is that the second host link reset make the subsequent
> guest's register reading fail, and I think this can be solved by test
> the device's accessibility(read device' register, all F's means
> unaccessible. IIRC, EEH of Power also use this way to test device's
> accessiblity) and delay guest's reading if device is temporarily
> unaccessible.

What is the actual AER handling requirement you're trying to solve?  Is
it simply to check a box on a marketing 

Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Alex Williamson
On Wed, 5 Apr 2017 16:54:33 +0800
Cao jin  wrote:

> Sorry for late. Distracted by other problem for a while.
> 
> On 03/31/2017 02:16 AM, Alex Williamson wrote:
> > On Thu, 30 Mar 2017 21:00:35 +0300  
> 
> >>   
> >>>
> >>> I also asked in my previous comments to provide examples of errors 
> >>> that
> >>> might trigger correctable errors to the user, this comment seems to
> >>> have been missed.  In my experience, AERs generated during device
> >>> assignment are generally hardware faults or induced by bad guest
> >>> drivers.  These are cases where a single fatal error is an appropriate
> >>> and sufficient response.  We've scaled back this support to the point
> >>> where we're only improving the situation of correctable errors and I'm
> >>> not convinced this is worthwhile and we're not simply checking a box 
> >>> on
> >>> an ill-conceived marketing requirements document.
> >>
> >> Sorry. I noticed that question: "what actual errors do we expect
> >> userspace to see as non-fatal errors?", but I am confused about it.
> >> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> >> and Uncorrectable Error Severity Register will tell which is fatal, and
> >> which is non-fatal, this register is configurable, they are device
> >> specific as I guess. AER core driver distinguish them by
> >> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand 
> >> your
> >> question. Or
> >>
> >> Or, Do you mean we could list the default non-fatal error of
> >> Uncorrectable Error Severity Register which is provided by PCIe spec?  
> >> 
> >
> > I'm trying to ask why is this patch series useful.  It's clearly
> > possible for us to signal non-fatal errors for a device to a guest, but
> > why is it necessarily a good idea to do so?  What additional RAS
> > feature is gained by this?  Can we give a single example of a real
> > world scenario where a guest has been shutdown due to a non-fatal error
> > that the guest driver would have handled?  
> 
>  We've been discussing AER for months if not years.
>  Isn't it a bit too late to ask whether AER recovery
>  by guests it useful at all?
> >>>
> >>>
> >>> Years, but I think that is more indicative of the persistence of the
> >>> developers rather than growing acceptance on my part.  For the majority
> >>> of that we were headed down the path of full AER support with the guest
> >>> able to invoke bus resets.  It was a complicated solution, but it was
> >>> more clear that it had some value.   Of course that's been derailed
> >>> due to various issues and we're now on this partial implementation that
> >>> only covers non-fatal errors that we assume the guest can recover from
> >>> without providing it mechanisms to do bus resets.  Is there actual
> >>> value to this or are we just trying to fill an AER checkbox on
> >>> someone's marketing sheet?  I don't think it's too much to ask for a
> >>> commit log to include evidence or discussion about how a feature is
> >>> actually a benefit to implement.
> >>
> >> Seems rather self evident but ok.  So something like
> >>
> >> With this patch, guest is able to recover from non-fatal correctable
> >> errors - as opposed to stopping the guest with no ability to
> >> recover which was the only option previously.
> >>
> >> Would this address your question?  
> > 
> > 
> > No, that's just restating the theoretical usefulness of this.  Have you
> > ever seen a non-fatal error?  Does it ever trigger?  If we can't
> > provide a real world case of this being useful, can we at least discuss
> > the types of things that might trigger a non-fatal error for which the
> > guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
> > of reducing VM stop conditions, but I suspect this is just a misuse of
> > statistics, just because there are two choices, fatal vs non-fatal,
> > does not make them equally likely.  Do we have any idea of the
> > incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
> >   
> 
> Apparently, I don't have experience to induce non-fatal error, device
> error is more of a chance related with the environment(temperature,
> humidity, etc) as I understand.
> 
> After reading the discussion, can I construe that the old design with
> full AER support is preferred than this new one?  The core issue of the
> old design is that the second host link reset make the subsequent
> guest's register reading fail, and I think this can be solved by test
> the device's accessibility(read device' register, all F's means
> unaccessible. IIRC, EEH of Power also use this way to test device's
> accessiblity) and delay guest's reading if device is temporarily
> unaccessible.

What is the actual AER handling requirement you're trying to solve?  Is
it simply to check a box on a marketing spec sheet for anything 

Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Cao jin
Sorry for late. Distracted by other problem for a while.

On 03/31/2017 02:16 AM, Alex Williamson wrote:
> On Thu, 30 Mar 2017 21:00:35 +0300

>> 
>>>
>>> I also asked in my previous comments to provide examples of errors that
>>> might trigger correctable errors to the user, this comment seems to
>>> have been missed.  In my experience, AERs generated during device
>>> assignment are generally hardware faults or induced by bad guest
>>> drivers.  These are cases where a single fatal error is an appropriate
>>> and sufficient response.  We've scaled back this support to the point
>>> where we're only improving the situation of correctable errors and I'm
>>> not convinced this is worthwhile and we're not simply checking a box on
>>> an ill-conceived marketing requirements document.  
>>
>> Sorry. I noticed that question: "what actual errors do we expect
>> userspace to see as non-fatal errors?", but I am confused about it.
>> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
>> and Uncorrectable Error Severity Register will tell which is fatal, and
>> which is non-fatal, this register is configurable, they are device
>> specific as I guess. AER core driver distinguish them by
>> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
>> question. Or
>>
>> Or, Do you mean we could list the default non-fatal error of
>> Uncorrectable Error Severity Register which is provided by PCIe spec?
>
> I'm trying to ask why is this patch series useful.  It's clearly
> possible for us to signal non-fatal errors for a device to a guest, but
> why is it necessarily a good idea to do so?  What additional RAS
> feature is gained by this?  Can we give a single example of a real
> world scenario where a guest has been shutdown due to a non-fatal error
> that the guest driver would have handled?

 We've been discussing AER for months if not years.
 Isn't it a bit too late to ask whether AER recovery
 by guests it useful at all?  
>>>
>>>
>>> Years, but I think that is more indicative of the persistence of the
>>> developers rather than growing acceptance on my part.  For the majority
>>> of that we were headed down the path of full AER support with the guest
>>> able to invoke bus resets.  It was a complicated solution, but it was
>>> more clear that it had some value.   Of course that's been derailed
>>> due to various issues and we're now on this partial implementation that
>>> only covers non-fatal errors that we assume the guest can recover from
>>> without providing it mechanisms to do bus resets.  Is there actual
>>> value to this or are we just trying to fill an AER checkbox on
>>> someone's marketing sheet?  I don't think it's too much to ask for a
>>> commit log to include evidence or discussion about how a feature is
>>> actually a benefit to implement.  
>>
>> Seems rather self evident but ok.  So something like
>>
>> With this patch, guest is able to recover from non-fatal correctable
>> errors - as opposed to stopping the guest with no ability to
>> recover which was the only option previously.
>>
>> Would this address your question?
> 
> 
> No, that's just restating the theoretical usefulness of this.  Have you
> ever seen a non-fatal error?  Does it ever trigger?  If we can't
> provide a real world case of this being useful, can we at least discuss
> the types of things that might trigger a non-fatal error for which the
> guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
> of reducing VM stop conditions, but I suspect this is just a misuse of
> statistics, just because there are two choices, fatal vs non-fatal,
> does not make them equally likely.  Do we have any idea of the
> incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
> 

Apparently, I don't have experience to induce non-fatal error, device
error is more of a chance related with the environment(temperature,
humidity, etc) as I understand.

After reading the discussion, can I construe that the old design with
full AER support is preferred than this new one?  The core issue of the
old design is that the second host link reset make the subsequent
guest's register reading fail, and I think this can be solved by test
the device's accessibility(read device' register, all F's means
unaccessible. IIRC, EEH of Power also use this way to test device's
accessiblity) and delay guest's reading if device is temporarily
unaccessible.

-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-04-05 Thread Cao jin
Sorry for late. Distracted by other problem for a while.

On 03/31/2017 02:16 AM, Alex Williamson wrote:
> On Thu, 30 Mar 2017 21:00:35 +0300

>> 
>>>
>>> I also asked in my previous comments to provide examples of errors that
>>> might trigger correctable errors to the user, this comment seems to
>>> have been missed.  In my experience, AERs generated during device
>>> assignment are generally hardware faults or induced by bad guest
>>> drivers.  These are cases where a single fatal error is an appropriate
>>> and sufficient response.  We've scaled back this support to the point
>>> where we're only improving the situation of correctable errors and I'm
>>> not convinced this is worthwhile and we're not simply checking a box on
>>> an ill-conceived marketing requirements document.  
>>
>> Sorry. I noticed that question: "what actual errors do we expect
>> userspace to see as non-fatal errors?", but I am confused about it.
>> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
>> and Uncorrectable Error Severity Register will tell which is fatal, and
>> which is non-fatal, this register is configurable, they are device
>> specific as I guess. AER core driver distinguish them by
>> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
>> question. Or
>>
>> Or, Do you mean we could list the default non-fatal error of
>> Uncorrectable Error Severity Register which is provided by PCIe spec?
>
> I'm trying to ask why is this patch series useful.  It's clearly
> possible for us to signal non-fatal errors for a device to a guest, but
> why is it necessarily a good idea to do so?  What additional RAS
> feature is gained by this?  Can we give a single example of a real
> world scenario where a guest has been shutdown due to a non-fatal error
> that the guest driver would have handled?

 We've been discussing AER for months if not years.
 Isn't it a bit too late to ask whether AER recovery
 by guests it useful at all?  
>>>
>>>
>>> Years, but I think that is more indicative of the persistence of the
>>> developers rather than growing acceptance on my part.  For the majority
>>> of that we were headed down the path of full AER support with the guest
>>> able to invoke bus resets.  It was a complicated solution, but it was
>>> more clear that it had some value.   Of course that's been derailed
>>> due to various issues and we're now on this partial implementation that
>>> only covers non-fatal errors that we assume the guest can recover from
>>> without providing it mechanisms to do bus resets.  Is there actual
>>> value to this or are we just trying to fill an AER checkbox on
>>> someone's marketing sheet?  I don't think it's too much to ask for a
>>> commit log to include evidence or discussion about how a feature is
>>> actually a benefit to implement.  
>>
>> Seems rather self evident but ok.  So something like
>>
>> With this patch, guest is able to recover from non-fatal correctable
>> errors - as opposed to stopping the guest with no ability to
>> recover which was the only option previously.
>>
>> Would this address your question?
> 
> 
> No, that's just restating the theoretical usefulness of this.  Have you
> ever seen a non-fatal error?  Does it ever trigger?  If we can't
> provide a real world case of this being useful, can we at least discuss
> the types of things that might trigger a non-fatal error for which the
> guest could recover?  In patch 3/3 Cao Jin claimed we have a 50% chance
> of reducing VM stop conditions, but I suspect this is just a misuse of
> statistics, just because there are two choices, fatal vs non-fatal,
> does not make them equally likely.  Do we have any idea of the
> incidence rate of non-fatal errors?  Is it non-zero?  Thanks,
> 

Apparently, I don't have experience to induce non-fatal error, device
error is more of a chance related with the environment(temperature,
humidity, etc) as I understand.

After reading the discussion, can I construe that the old design with
full AER support is preferred than this new one?  The core issue of the
old design is that the second host link reset make the subsequent
guest's register reading fail, and I think this can be solved by test
the device's accessibility(read device' register, all F's means
unaccessible. IIRC, EEH of Power also use this way to test device's
accessiblity) and delay guest's reading if device is temporarily
unaccessible.

-- 
Sincerely,
Cao jin




Re: [PATCH v6] vfio error recovery: kernel support

2017-03-30 Thread Alex Williamson
On Thu, 30 Mar 2017 21:00:35 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Mar 28, 2017 at 08:55:13PM -0600, Alex Williamson wrote:
> > On Wed, 29 Mar 2017 03:01:48 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:  
> > > > On Tue, 28 Mar 2017 21:47:00 +0800
> > > > Cao jin  wrote:
> > > > 
> > > > > On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > > > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > > > Cao jin  wrote:
> > > > > > 
> > > > > > A more appropriate patch subject would be:
> > > > > > 
> > > > > > vfio-pci: Report correctable errors and slot reset events to user
> > > > > >  
> > > > > 
> > > > > Correctable? It is confusing to me. Correctable error has its clear
> > > > > definition in PCIe spec, shouldn't it be "non-fatal"?
> > > > 
> > > > My mistake, non-fatal.
> > > >  
> > > > > >> From: "Michael S. Tsirkin"   
> > > > > > 
> > > > > > This hardly seems accurate anymore.  You could say Suggested-by and 
> > > > > > let
> > > > > > Michael add a sign-off, but it's changed since he sent it.
> > > > > >   
> > > > > >>
> > > > > >> 0. What happens now (PCIE AER only)
> > > > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > > > >>because it's detected and reported asynchronously.
> > > > > >>Interrupts are forwarded as usual.
> > > > > >>Correctable errors are not reported to user at all.
> > > > > >>
> > > > > >>Note:
> > > > > >>PPC EEH is different, but this approach won't affect EEH. EEH 
> > > > > >> treat
> > > > > >>all errors as fatal ones in AER, so they will still be 
> > > > > >> signalled to user
> > > > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > > > >> belongs
> > > > > >>to the same IOMMU group, so the slot_reset handler in this 
> > > > > >> approach
> > > > > >>won't affect EEH either.
> > > > > >>
> > > > > >> 1. Correctable errors
> > > > > >>Hardware can correct these errors without software intervention,
> > > > > >>clear the error status is enough, this is what already done now.
> > > > > >>No need to recover it, nothing changed, leave it as it is.
> > > > > >>
> > > > > >> 2. Fatal errors
> > > > > >>They will induce a link reset. This is troublesome when user is
> > > > > >>a QEMU guest. This approach doesn't touch the existing 
> > > > > >> mechanism.
> > > > > >>
> > > > > >> 3. Non-fatal errors
> > > > > >>Before this patch, they are signalled to user the same way as 
> > > > > >> fatal ones.
> > > > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > > > >> error
> > > > > >>notification. By splitting non-fatal ones out, it will benefit 
> > > > > >> AER
> > > > > >>recovery of a QEMU guest user.
> > > > > >>
> > > > > >>To maintain backwards compatibility with userspace, non-fatal 
> > > > > >> errors
> > > > > >>will continue to trigger via the existing error interrupt index 
> > > > > >> if a
> > > > > >>non-fatal signaling mechanism has not been registered.
> > > > > >>
> > > > > >>Note:
> > > > > >>In case of PCI Express errors, kernel might request a slot reset
> > > > > >>affecting our device (from our point of view this is a passive 
> > > > > >> device
> > > > > >>reset as opposed to an active one requested by vfio itself).
> > > > > >>This might currently happen if a slot reset is requested by a 
> > > > > >> driver
> > > > > >>(other than vfio) bound to another device function in the same 
> > > > > >> slot.
> > > > > >>This will cause our device to lose its state so report this 
> > > > > >> event to
> > > > > >>userspace.  
> > > > > > 
> > > > > > I tried to convey this in my last comments, I don't think this is an
> > > > > > appropriate commit log.  Lead with what is the problem you're 
> > > > > > trying to
> > > > > > fix and why, what is the benefit to the user, and how is the change
> > > > > > accomplished.  If you want to provide a State of Error Handling in
> > > > > > VFIO, append it after the main points of the commit log.  
> > > > > 
> > > > > ok.
> > > > > 
> > > > > > 
> > > > > > I also asked in my previous comments to provide examples of errors 
> > > > > > that
> > > > > > might trigger correctable errors to the user, this comment seems to
> > > > > > have been missed.  In my experience, AERs generated during device
> > > > > > assignment are generally hardware faults or induced by bad guest
> > > > > > drivers.  These are cases where a single fatal error is an 
> > > > > > appropriate
> > > > > > and sufficient response.  We've scaled back this support to the 
> > > > > > point
> > > > > > where we're only improving the situation of correctable errors and 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-30 Thread Alex Williamson
On Thu, 30 Mar 2017 21:00:35 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Mar 28, 2017 at 08:55:13PM -0600, Alex Williamson wrote:
> > On Wed, 29 Mar 2017 03:01:48 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:  
> > > > On Tue, 28 Mar 2017 21:47:00 +0800
> > > > Cao jin  wrote:
> > > > 
> > > > > On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > > > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > > > Cao jin  wrote:
> > > > > > 
> > > > > > A more appropriate patch subject would be:
> > > > > > 
> > > > > > vfio-pci: Report correctable errors and slot reset events to user
> > > > > >  
> > > > > 
> > > > > Correctable? It is confusing to me. Correctable error has its clear
> > > > > definition in PCIe spec, shouldn't it be "non-fatal"?
> > > > 
> > > > My mistake, non-fatal.
> > > >  
> > > > > >> From: "Michael S. Tsirkin"   
> > > > > > 
> > > > > > This hardly seems accurate anymore.  You could say Suggested-by and 
> > > > > > let
> > > > > > Michael add a sign-off, but it's changed since he sent it.
> > > > > >   
> > > > > >>
> > > > > >> 0. What happens now (PCIE AER only)
> > > > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > > > >>because it's detected and reported asynchronously.
> > > > > >>Interrupts are forwarded as usual.
> > > > > >>Correctable errors are not reported to user at all.
> > > > > >>
> > > > > >>Note:
> > > > > >>PPC EEH is different, but this approach won't affect EEH. EEH 
> > > > > >> treat
> > > > > >>all errors as fatal ones in AER, so they will still be 
> > > > > >> signalled to user
> > > > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > > > >> belongs
> > > > > >>to the same IOMMU group, so the slot_reset handler in this 
> > > > > >> approach
> > > > > >>won't affect EEH either.
> > > > > >>
> > > > > >> 1. Correctable errors
> > > > > >>Hardware can correct these errors without software intervention,
> > > > > >>clear the error status is enough, this is what already done now.
> > > > > >>No need to recover it, nothing changed, leave it as it is.
> > > > > >>
> > > > > >> 2. Fatal errors
> > > > > >>They will induce a link reset. This is troublesome when user is
> > > > > >>a QEMU guest. This approach doesn't touch the existing 
> > > > > >> mechanism.
> > > > > >>
> > > > > >> 3. Non-fatal errors
> > > > > >>Before this patch, they are signalled to user the same way as 
> > > > > >> fatal ones.
> > > > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > > > >> error
> > > > > >>notification. By splitting non-fatal ones out, it will benefit 
> > > > > >> AER
> > > > > >>recovery of a QEMU guest user.
> > > > > >>
> > > > > >>To maintain backwards compatibility with userspace, non-fatal 
> > > > > >> errors
> > > > > >>will continue to trigger via the existing error interrupt index 
> > > > > >> if a
> > > > > >>non-fatal signaling mechanism has not been registered.
> > > > > >>
> > > > > >>Note:
> > > > > >>In case of PCI Express errors, kernel might request a slot reset
> > > > > >>affecting our device (from our point of view this is a passive 
> > > > > >> device
> > > > > >>reset as opposed to an active one requested by vfio itself).
> > > > > >>This might currently happen if a slot reset is requested by a 
> > > > > >> driver
> > > > > >>(other than vfio) bound to another device function in the same 
> > > > > >> slot.
> > > > > >>This will cause our device to lose its state so report this 
> > > > > >> event to
> > > > > >>userspace.  
> > > > > > 
> > > > > > I tried to convey this in my last comments, I don't think this is an
> > > > > > appropriate commit log.  Lead with what is the problem you're 
> > > > > > trying to
> > > > > > fix and why, what is the benefit to the user, and how is the change
> > > > > > accomplished.  If you want to provide a State of Error Handling in
> > > > > > VFIO, append it after the main points of the commit log.  
> > > > > 
> > > > > ok.
> > > > > 
> > > > > > 
> > > > > > I also asked in my previous comments to provide examples of errors 
> > > > > > that
> > > > > > might trigger correctable errors to the user, this comment seems to
> > > > > > have been missed.  In my experience, AERs generated during device
> > > > > > assignment are generally hardware faults or induced by bad guest
> > > > > > drivers.  These are cases where a single fatal error is an 
> > > > > > appropriate
> > > > > > and sufficient response.  We've scaled back this support to the 
> > > > > > point
> > > > > > where we're only improving the situation of correctable errors and 
> > > > > > I'm
> > > > > > not convinced this is worthwhile and we're not simply checking a 
> > > > 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-30 Thread Michael S. Tsirkin
On Tue, Mar 28, 2017 at 08:55:13PM -0600, Alex Williamson wrote:
> On Wed, 29 Mar 2017 03:01:48 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> > > On Tue, 28 Mar 2017 21:47:00 +0800
> > > Cao jin  wrote:
> > >   
> > > > On 03/25/2017 06:12 AM, Alex Williamson wrote:  
> > > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > > Cao jin  wrote:
> > > > > 
> > > > > A more appropriate patch subject would be:
> > > > > 
> > > > > vfio-pci: Report correctable errors and slot reset events to user
> > > > >
> > > > 
> > > > Correctable? It is confusing to me. Correctable error has its clear
> > > > definition in PCIe spec, shouldn't it be "non-fatal"?  
> > > 
> > > My mistake, non-fatal.
> > >
> > > > >> From: "Michael S. Tsirkin" 
> > > > > 
> > > > > This hardly seems accurate anymore.  You could say Suggested-by and 
> > > > > let
> > > > > Michael add a sign-off, but it's changed since he sent it.
> > > > > 
> > > > >>
> > > > >> 0. What happens now (PCIE AER only)
> > > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > > >>because it's detected and reported asynchronously.
> > > > >>Interrupts are forwarded as usual.
> > > > >>Correctable errors are not reported to user at all.
> > > > >>
> > > > >>Note:
> > > > >>PPC EEH is different, but this approach won't affect EEH. EEH 
> > > > >> treat
> > > > >>all errors as fatal ones in AER, so they will still be signalled 
> > > > >> to user
> > > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > > >> belongs
> > > > >>to the same IOMMU group, so the slot_reset handler in this 
> > > > >> approach
> > > > >>won't affect EEH either.
> > > > >>
> > > > >> 1. Correctable errors
> > > > >>Hardware can correct these errors without software intervention,
> > > > >>clear the error status is enough, this is what already done now.
> > > > >>No need to recover it, nothing changed, leave it as it is.
> > > > >>
> > > > >> 2. Fatal errors
> > > > >>They will induce a link reset. This is troublesome when user is
> > > > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > > > >>
> > > > >> 3. Non-fatal errors
> > > > >>Before this patch, they are signalled to user the same way as 
> > > > >> fatal ones.
> > > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > > >> error
> > > > >>notification. By splitting non-fatal ones out, it will benefit AER
> > > > >>recovery of a QEMU guest user.
> > > > >>
> > > > >>To maintain backwards compatibility with userspace, non-fatal 
> > > > >> errors
> > > > >>will continue to trigger via the existing error interrupt index 
> > > > >> if a
> > > > >>non-fatal signaling mechanism has not been registered.
> > > > >>
> > > > >>Note:
> > > > >>In case of PCI Express errors, kernel might request a slot reset
> > > > >>affecting our device (from our point of view this is a passive 
> > > > >> device
> > > > >>reset as opposed to an active one requested by vfio itself).
> > > > >>This might currently happen if a slot reset is requested by a 
> > > > >> driver
> > > > >>(other than vfio) bound to another device function in the same 
> > > > >> slot.
> > > > >>This will cause our device to lose its state so report this event 
> > > > >> to
> > > > >>userspace.
> > > > > 
> > > > > I tried to convey this in my last comments, I don't think this is an
> > > > > appropriate commit log.  Lead with what is the problem you're trying 
> > > > > to
> > > > > fix and why, what is the benefit to the user, and how is the change
> > > > > accomplished.  If you want to provide a State of Error Handling in
> > > > > VFIO, append it after the main points of the commit log.
> > > > 
> > > > ok.
> > > >   
> > > > > 
> > > > > I also asked in my previous comments to provide examples of errors 
> > > > > that
> > > > > might trigger correctable errors to the user, this comment seems to
> > > > > have been missed.  In my experience, AERs generated during device
> > > > > assignment are generally hardware faults or induced by bad guest
> > > > > drivers.  These are cases where a single fatal error is an appropriate
> > > > > and sufficient response.  We've scaled back this support to the point
> > > > > where we're only improving the situation of correctable errors and I'm
> > > > > not convinced this is worthwhile and we're not simply checking a box 
> > > > > on
> > > > > an ill-conceived marketing requirements document.
> > > > 
> > > > Sorry. I noticed that question: "what actual errors do we expect
> > > > userspace to see as non-fatal errors?", but I am confused about it.
> > > > Correctable, non-fatal, fatal 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-30 Thread Michael S. Tsirkin
On Tue, Mar 28, 2017 at 08:55:13PM -0600, Alex Williamson wrote:
> On Wed, 29 Mar 2017 03:01:48 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> > > On Tue, 28 Mar 2017 21:47:00 +0800
> > > Cao jin  wrote:
> > >   
> > > > On 03/25/2017 06:12 AM, Alex Williamson wrote:  
> > > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > > Cao jin  wrote:
> > > > > 
> > > > > A more appropriate patch subject would be:
> > > > > 
> > > > > vfio-pci: Report correctable errors and slot reset events to user
> > > > >
> > > > 
> > > > Correctable? It is confusing to me. Correctable error has its clear
> > > > definition in PCIe spec, shouldn't it be "non-fatal"?  
> > > 
> > > My mistake, non-fatal.
> > >
> > > > >> From: "Michael S. Tsirkin" 
> > > > > 
> > > > > This hardly seems accurate anymore.  You could say Suggested-by and 
> > > > > let
> > > > > Michael add a sign-off, but it's changed since he sent it.
> > > > > 
> > > > >>
> > > > >> 0. What happens now (PCIE AER only)
> > > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > > >>because it's detected and reported asynchronously.
> > > > >>Interrupts are forwarded as usual.
> > > > >>Correctable errors are not reported to user at all.
> > > > >>
> > > > >>Note:
> > > > >>PPC EEH is different, but this approach won't affect EEH. EEH 
> > > > >> treat
> > > > >>all errors as fatal ones in AER, so they will still be signalled 
> > > > >> to user
> > > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > > >> belongs
> > > > >>to the same IOMMU group, so the slot_reset handler in this 
> > > > >> approach
> > > > >>won't affect EEH either.
> > > > >>
> > > > >> 1. Correctable errors
> > > > >>Hardware can correct these errors without software intervention,
> > > > >>clear the error status is enough, this is what already done now.
> > > > >>No need to recover it, nothing changed, leave it as it is.
> > > > >>
> > > > >> 2. Fatal errors
> > > > >>They will induce a link reset. This is troublesome when user is
> > > > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > > > >>
> > > > >> 3. Non-fatal errors
> > > > >>Before this patch, they are signalled to user the same way as 
> > > > >> fatal ones.
> > > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > > >> error
> > > > >>notification. By splitting non-fatal ones out, it will benefit AER
> > > > >>recovery of a QEMU guest user.
> > > > >>
> > > > >>To maintain backwards compatibility with userspace, non-fatal 
> > > > >> errors
> > > > >>will continue to trigger via the existing error interrupt index 
> > > > >> if a
> > > > >>non-fatal signaling mechanism has not been registered.
> > > > >>
> > > > >>Note:
> > > > >>In case of PCI Express errors, kernel might request a slot reset
> > > > >>affecting our device (from our point of view this is a passive 
> > > > >> device
> > > > >>reset as opposed to an active one requested by vfio itself).
> > > > >>This might currently happen if a slot reset is requested by a 
> > > > >> driver
> > > > >>(other than vfio) bound to another device function in the same 
> > > > >> slot.
> > > > >>This will cause our device to lose its state so report this event 
> > > > >> to
> > > > >>userspace.
> > > > > 
> > > > > I tried to convey this in my last comments, I don't think this is an
> > > > > appropriate commit log.  Lead with what is the problem you're trying 
> > > > > to
> > > > > fix and why, what is the benefit to the user, and how is the change
> > > > > accomplished.  If you want to provide a State of Error Handling in
> > > > > VFIO, append it after the main points of the commit log.
> > > > 
> > > > ok.
> > > >   
> > > > > 
> > > > > I also asked in my previous comments to provide examples of errors 
> > > > > that
> > > > > might trigger correctable errors to the user, this comment seems to
> > > > > have been missed.  In my experience, AERs generated during device
> > > > > assignment are generally hardware faults or induced by bad guest
> > > > > drivers.  These are cases where a single fatal error is an appropriate
> > > > > and sufficient response.  We've scaled back this support to the point
> > > > > where we're only improving the situation of correctable errors and I'm
> > > > > not convinced this is worthwhile and we're not simply checking a box 
> > > > > on
> > > > > an ill-conceived marketing requirements document.
> > > > 
> > > > Sorry. I noticed that question: "what actual errors do we expect
> > > > userspace to see as non-fatal errors?", but I am confused about it.
> > > > Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> > > > and Uncorrectable Error Severity Register 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Alex Williamson
On Wed, 29 Mar 2017 03:01:48 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> > On Tue, 28 Mar 2017 21:47:00 +0800
> > Cao jin  wrote:
> >   
> > > On 03/25/2017 06:12 AM, Alex Williamson wrote:  
> > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > Cao jin  wrote:
> > > > 
> > > > A more appropriate patch subject would be:
> > > > 
> > > > vfio-pci: Report correctable errors and slot reset events to user
> > > >
> > > 
> > > Correctable? It is confusing to me. Correctable error has its clear
> > > definition in PCIe spec, shouldn't it be "non-fatal"?  
> > 
> > My mistake, non-fatal.
> >
> > > >> From: "Michael S. Tsirkin" 
> > > > 
> > > > This hardly seems accurate anymore.  You could say Suggested-by and let
> > > > Michael add a sign-off, but it's changed since he sent it.
> > > > 
> > > >>
> > > >> 0. What happens now (PCIE AER only)
> > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > >>because it's detected and reported asynchronously.
> > > >>Interrupts are forwarded as usual.
> > > >>Correctable errors are not reported to user at all.
> > > >>
> > > >>Note:
> > > >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> > > >>all errors as fatal ones in AER, so they will still be signalled to 
> > > >> user
> > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > >> belongs
> > > >>to the same IOMMU group, so the slot_reset handler in this approach
> > > >>won't affect EEH either.
> > > >>
> > > >> 1. Correctable errors
> > > >>Hardware can correct these errors without software intervention,
> > > >>clear the error status is enough, this is what already done now.
> > > >>No need to recover it, nothing changed, leave it as it is.
> > > >>
> > > >> 2. Fatal errors
> > > >>They will induce a link reset. This is troublesome when user is
> > > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > > >>
> > > >> 3. Non-fatal errors
> > > >>Before this patch, they are signalled to user the same way as fatal 
> > > >> ones.
> > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > >> error
> > > >>notification. By splitting non-fatal ones out, it will benefit AER
> > > >>recovery of a QEMU guest user.
> > > >>
> > > >>To maintain backwards compatibility with userspace, non-fatal errors
> > > >>will continue to trigger via the existing error interrupt index if a
> > > >>non-fatal signaling mechanism has not been registered.
> > > >>
> > > >>Note:
> > > >>In case of PCI Express errors, kernel might request a slot reset
> > > >>affecting our device (from our point of view this is a passive 
> > > >> device
> > > >>reset as opposed to an active one requested by vfio itself).
> > > >>This might currently happen if a slot reset is requested by a driver
> > > >>(other than vfio) bound to another device function in the same slot.
> > > >>This will cause our device to lose its state so report this event to
> > > >>userspace.
> > > > 
> > > > I tried to convey this in my last comments, I don't think this is an
> > > > appropriate commit log.  Lead with what is the problem you're trying to
> > > > fix and why, what is the benefit to the user, and how is the change
> > > > accomplished.  If you want to provide a State of Error Handling in
> > > > VFIO, append it after the main points of the commit log.
> > > 
> > > ok.
> > >   
> > > > 
> > > > I also asked in my previous comments to provide examples of errors that
> > > > might trigger correctable errors to the user, this comment seems to
> > > > have been missed.  In my experience, AERs generated during device
> > > > assignment are generally hardware faults or induced by bad guest
> > > > drivers.  These are cases where a single fatal error is an appropriate
> > > > and sufficient response.  We've scaled back this support to the point
> > > > where we're only improving the situation of correctable errors and I'm
> > > > not convinced this is worthwhile and we're not simply checking a box on
> > > > an ill-conceived marketing requirements document.
> > > 
> > > Sorry. I noticed that question: "what actual errors do we expect
> > > userspace to see as non-fatal errors?", but I am confused about it.
> > > Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> > > and Uncorrectable Error Severity Register will tell which is fatal, and
> > > which is non-fatal, this register is configurable, they are device
> > > specific as I guess. AER core driver distinguish them by
> > > pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> > > question. Or
> > > 
> > > Or, Do you mean we could 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Alex Williamson
On Wed, 29 Mar 2017 03:01:48 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> > On Tue, 28 Mar 2017 21:47:00 +0800
> > Cao jin  wrote:
> >   
> > > On 03/25/2017 06:12 AM, Alex Williamson wrote:  
> > > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > > Cao jin  wrote:
> > > > 
> > > > A more appropriate patch subject would be:
> > > > 
> > > > vfio-pci: Report correctable errors and slot reset events to user
> > > >
> > > 
> > > Correctable? It is confusing to me. Correctable error has its clear
> > > definition in PCIe spec, shouldn't it be "non-fatal"?  
> > 
> > My mistake, non-fatal.
> >
> > > >> From: "Michael S. Tsirkin" 
> > > > 
> > > > This hardly seems accurate anymore.  You could say Suggested-by and let
> > > > Michael add a sign-off, but it's changed since he sent it.
> > > > 
> > > >>
> > > >> 0. What happens now (PCIE AER only)
> > > >>Fatal errors cause a link reset. Non fatal errors don't.
> > > >>All errors stop the QEMU guest eventually, but not immediately,
> > > >>because it's detected and reported asynchronously.
> > > >>Interrupts are forwarded as usual.
> > > >>Correctable errors are not reported to user at all.
> > > >>
> > > >>Note:
> > > >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> > > >>all errors as fatal ones in AER, so they will still be signalled to 
> > > >> user
> > > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > > >> belongs
> > > >>to the same IOMMU group, so the slot_reset handler in this approach
> > > >>won't affect EEH either.
> > > >>
> > > >> 1. Correctable errors
> > > >>Hardware can correct these errors without software intervention,
> > > >>clear the error status is enough, this is what already done now.
> > > >>No need to recover it, nothing changed, leave it as it is.
> > > >>
> > > >> 2. Fatal errors
> > > >>They will induce a link reset. This is troublesome when user is
> > > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > > >>
> > > >> 3. Non-fatal errors
> > > >>Before this patch, they are signalled to user the same way as fatal 
> > > >> ones.
> > > >>With this patch, a new eventfd is introduced only for non-fatal 
> > > >> error
> > > >>notification. By splitting non-fatal ones out, it will benefit AER
> > > >>recovery of a QEMU guest user.
> > > >>
> > > >>To maintain backwards compatibility with userspace, non-fatal errors
> > > >>will continue to trigger via the existing error interrupt index if a
> > > >>non-fatal signaling mechanism has not been registered.
> > > >>
> > > >>Note:
> > > >>In case of PCI Express errors, kernel might request a slot reset
> > > >>affecting our device (from our point of view this is a passive 
> > > >> device
> > > >>reset as opposed to an active one requested by vfio itself).
> > > >>This might currently happen if a slot reset is requested by a driver
> > > >>(other than vfio) bound to another device function in the same slot.
> > > >>This will cause our device to lose its state so report this event to
> > > >>userspace.
> > > > 
> > > > I tried to convey this in my last comments, I don't think this is an
> > > > appropriate commit log.  Lead with what is the problem you're trying to
> > > > fix and why, what is the benefit to the user, and how is the change
> > > > accomplished.  If you want to provide a State of Error Handling in
> > > > VFIO, append it after the main points of the commit log.
> > > 
> > > ok.
> > >   
> > > > 
> > > > I also asked in my previous comments to provide examples of errors that
> > > > might trigger correctable errors to the user, this comment seems to
> > > > have been missed.  In my experience, AERs generated during device
> > > > assignment are generally hardware faults or induced by bad guest
> > > > drivers.  These are cases where a single fatal error is an appropriate
> > > > and sufficient response.  We've scaled back this support to the point
> > > > where we're only improving the situation of correctable errors and I'm
> > > > not convinced this is worthwhile and we're not simply checking a box on
> > > > an ill-conceived marketing requirements document.
> > > 
> > > Sorry. I noticed that question: "what actual errors do we expect
> > > userspace to see as non-fatal errors?", but I am confused about it.
> > > Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> > > and Uncorrectable Error Severity Register will tell which is fatal, and
> > > which is non-fatal, this register is configurable, they are device
> > > specific as I guess. AER core driver distinguish them by
> > > pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> > > question. Or
> > > 
> > > Or, Do you mean we could list the default non-fatal error of
> > > Uncorrectable Error Severity Register which 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Michael S. Tsirkin
On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> On Tue, 28 Mar 2017 21:47:00 +0800
> Cao jin  wrote:
> 
> > On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > Cao jin  wrote:
> > > 
> > > A more appropriate patch subject would be:
> > > 
> > > vfio-pci: Report correctable errors and slot reset events to user
> > >  
> > 
> > Correctable? It is confusing to me. Correctable error has its clear
> > definition in PCIe spec, shouldn't it be "non-fatal"?
> 
> My mistake, non-fatal.
>  
> > >> From: "Michael S. Tsirkin"   
> > > 
> > > This hardly seems accurate anymore.  You could say Suggested-by and let
> > > Michael add a sign-off, but it's changed since he sent it.
> > >   
> > >>
> > >> 0. What happens now (PCIE AER only)
> > >>Fatal errors cause a link reset. Non fatal errors don't.
> > >>All errors stop the QEMU guest eventually, but not immediately,
> > >>because it's detected and reported asynchronously.
> > >>Interrupts are forwarded as usual.
> > >>Correctable errors are not reported to user at all.
> > >>
> > >>Note:
> > >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> > >>all errors as fatal ones in AER, so they will still be signalled to 
> > >> user
> > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > >> belongs
> > >>to the same IOMMU group, so the slot_reset handler in this approach
> > >>won't affect EEH either.
> > >>
> > >> 1. Correctable errors
> > >>Hardware can correct these errors without software intervention,
> > >>clear the error status is enough, this is what already done now.
> > >>No need to recover it, nothing changed, leave it as it is.
> > >>
> > >> 2. Fatal errors
> > >>They will induce a link reset. This is troublesome when user is
> > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > >>
> > >> 3. Non-fatal errors
> > >>Before this patch, they are signalled to user the same way as fatal 
> > >> ones.
> > >>With this patch, a new eventfd is introduced only for non-fatal error
> > >>notification. By splitting non-fatal ones out, it will benefit AER
> > >>recovery of a QEMU guest user.
> > >>
> > >>To maintain backwards compatibility with userspace, non-fatal errors
> > >>will continue to trigger via the existing error interrupt index if a
> > >>non-fatal signaling mechanism has not been registered.
> > >>
> > >>Note:
> > >>In case of PCI Express errors, kernel might request a slot reset
> > >>affecting our device (from our point of view this is a passive device
> > >>reset as opposed to an active one requested by vfio itself).
> > >>This might currently happen if a slot reset is requested by a driver
> > >>(other than vfio) bound to another device function in the same slot.
> > >>This will cause our device to lose its state so report this event to
> > >>userspace.  
> > > 
> > > I tried to convey this in my last comments, I don't think this is an
> > > appropriate commit log.  Lead with what is the problem you're trying to
> > > fix and why, what is the benefit to the user, and how is the change
> > > accomplished.  If you want to provide a State of Error Handling in
> > > VFIO, append it after the main points of the commit log.  
> > 
> > ok.
> > 
> > > 
> > > I also asked in my previous comments to provide examples of errors that
> > > might trigger correctable errors to the user, this comment seems to
> > > have been missed.  In my experience, AERs generated during device
> > > assignment are generally hardware faults or induced by bad guest
> > > drivers.  These are cases where a single fatal error is an appropriate
> > > and sufficient response.  We've scaled back this support to the point
> > > where we're only improving the situation of correctable errors and I'm
> > > not convinced this is worthwhile and we're not simply checking a box on
> > > an ill-conceived marketing requirements document.  
> > 
> > Sorry. I noticed that question: "what actual errors do we expect
> > userspace to see as non-fatal errors?", but I am confused about it.
> > Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> > and Uncorrectable Error Severity Register will tell which is fatal, and
> > which is non-fatal, this register is configurable, they are device
> > specific as I guess. AER core driver distinguish them by
> > pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> > question. Or
> > 
> > Or, Do you mean we could list the default non-fatal error of
> > Uncorrectable Error Severity Register which is provided by PCIe spec?
> 
> I'm trying to ask why is this patch series useful.  It's clearly
> possible for us to signal non-fatal errors for a device to a guest, but
> why is it necessarily a good idea to do so?  What additional 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Michael S. Tsirkin
On Tue, Mar 28, 2017 at 10:12:33AM -0600, Alex Williamson wrote:
> On Tue, 28 Mar 2017 21:47:00 +0800
> Cao jin  wrote:
> 
> > On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > > On Thu, 23 Mar 2017 17:07:31 +0800
> > > Cao jin  wrote:
> > > 
> > > A more appropriate patch subject would be:
> > > 
> > > vfio-pci: Report correctable errors and slot reset events to user
> > >  
> > 
> > Correctable? It is confusing to me. Correctable error has its clear
> > definition in PCIe spec, shouldn't it be "non-fatal"?
> 
> My mistake, non-fatal.
>  
> > >> From: "Michael S. Tsirkin"   
> > > 
> > > This hardly seems accurate anymore.  You could say Suggested-by and let
> > > Michael add a sign-off, but it's changed since he sent it.
> > >   
> > >>
> > >> 0. What happens now (PCIE AER only)
> > >>Fatal errors cause a link reset. Non fatal errors don't.
> > >>All errors stop the QEMU guest eventually, but not immediately,
> > >>because it's detected and reported asynchronously.
> > >>Interrupts are forwarded as usual.
> > >>Correctable errors are not reported to user at all.
> > >>
> > >>Note:
> > >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> > >>all errors as fatal ones in AER, so they will still be signalled to 
> > >> user
> > >>via the legacy eventfd.  Besides, all devices/functions in a PE 
> > >> belongs
> > >>to the same IOMMU group, so the slot_reset handler in this approach
> > >>won't affect EEH either.
> > >>
> > >> 1. Correctable errors
> > >>Hardware can correct these errors without software intervention,
> > >>clear the error status is enough, this is what already done now.
> > >>No need to recover it, nothing changed, leave it as it is.
> > >>
> > >> 2. Fatal errors
> > >>They will induce a link reset. This is troublesome when user is
> > >>a QEMU guest. This approach doesn't touch the existing mechanism.
> > >>
> > >> 3. Non-fatal errors
> > >>Before this patch, they are signalled to user the same way as fatal 
> > >> ones.
> > >>With this patch, a new eventfd is introduced only for non-fatal error
> > >>notification. By splitting non-fatal ones out, it will benefit AER
> > >>recovery of a QEMU guest user.
> > >>
> > >>To maintain backwards compatibility with userspace, non-fatal errors
> > >>will continue to trigger via the existing error interrupt index if a
> > >>non-fatal signaling mechanism has not been registered.
> > >>
> > >>Note:
> > >>In case of PCI Express errors, kernel might request a slot reset
> > >>affecting our device (from our point of view this is a passive device
> > >>reset as opposed to an active one requested by vfio itself).
> > >>This might currently happen if a slot reset is requested by a driver
> > >>(other than vfio) bound to another device function in the same slot.
> > >>This will cause our device to lose its state so report this event to
> > >>userspace.  
> > > 
> > > I tried to convey this in my last comments, I don't think this is an
> > > appropriate commit log.  Lead with what is the problem you're trying to
> > > fix and why, what is the benefit to the user, and how is the change
> > > accomplished.  If you want to provide a State of Error Handling in
> > > VFIO, append it after the main points of the commit log.  
> > 
> > ok.
> > 
> > > 
> > > I also asked in my previous comments to provide examples of errors that
> > > might trigger correctable errors to the user, this comment seems to
> > > have been missed.  In my experience, AERs generated during device
> > > assignment are generally hardware faults or induced by bad guest
> > > drivers.  These are cases where a single fatal error is an appropriate
> > > and sufficient response.  We've scaled back this support to the point
> > > where we're only improving the situation of correctable errors and I'm
> > > not convinced this is worthwhile and we're not simply checking a box on
> > > an ill-conceived marketing requirements document.  
> > 
> > Sorry. I noticed that question: "what actual errors do we expect
> > userspace to see as non-fatal errors?", but I am confused about it.
> > Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> > and Uncorrectable Error Severity Register will tell which is fatal, and
> > which is non-fatal, this register is configurable, they are device
> > specific as I guess. AER core driver distinguish them by
> > pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> > question. Or
> > 
> > Or, Do you mean we could list the default non-fatal error of
> > Uncorrectable Error Severity Register which is provided by PCIe spec?
> 
> I'm trying to ask why is this patch series useful.  It's clearly
> possible for us to signal non-fatal errors for a device to a guest, but
> why is it necessarily a good idea to do so?  What additional RAS
> feature is gained by this?  Can we give a single example of a 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Alex Williamson
On Tue, 28 Mar 2017 21:47:00 +0800
Cao jin  wrote:

> On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > On Thu, 23 Mar 2017 17:07:31 +0800
> > Cao jin  wrote:
> > 
> > A more appropriate patch subject would be:
> > 
> > vfio-pci: Report correctable errors and slot reset events to user
> >  
> 
> Correctable? It is confusing to me. Correctable error has its clear
> definition in PCIe spec, shouldn't it be "non-fatal"?

My mistake, non-fatal.
 
> >> From: "Michael S. Tsirkin"   
> > 
> > This hardly seems accurate anymore.  You could say Suggested-by and let
> > Michael add a sign-off, but it's changed since he sent it.
> >   
> >>
> >> 0. What happens now (PCIE AER only)
> >>Fatal errors cause a link reset. Non fatal errors don't.
> >>All errors stop the QEMU guest eventually, but not immediately,
> >>because it's detected and reported asynchronously.
> >>Interrupts are forwarded as usual.
> >>Correctable errors are not reported to user at all.
> >>
> >>Note:
> >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> >>all errors as fatal ones in AER, so they will still be signalled to user
> >>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
> >>to the same IOMMU group, so the slot_reset handler in this approach
> >>won't affect EEH either.
> >>
> >> 1. Correctable errors
> >>Hardware can correct these errors without software intervention,
> >>clear the error status is enough, this is what already done now.
> >>No need to recover it, nothing changed, leave it as it is.
> >>
> >> 2. Fatal errors
> >>They will induce a link reset. This is troublesome when user is
> >>a QEMU guest. This approach doesn't touch the existing mechanism.
> >>
> >> 3. Non-fatal errors
> >>Before this patch, they are signalled to user the same way as fatal 
> >> ones.
> >>With this patch, a new eventfd is introduced only for non-fatal error
> >>notification. By splitting non-fatal ones out, it will benefit AER
> >>recovery of a QEMU guest user.
> >>
> >>To maintain backwards compatibility with userspace, non-fatal errors
> >>will continue to trigger via the existing error interrupt index if a
> >>non-fatal signaling mechanism has not been registered.
> >>
> >>Note:
> >>In case of PCI Express errors, kernel might request a slot reset
> >>affecting our device (from our point of view this is a passive device
> >>reset as opposed to an active one requested by vfio itself).
> >>This might currently happen if a slot reset is requested by a driver
> >>(other than vfio) bound to another device function in the same slot.
> >>This will cause our device to lose its state so report this event to
> >>userspace.  
> > 
> > I tried to convey this in my last comments, I don't think this is an
> > appropriate commit log.  Lead with what is the problem you're trying to
> > fix and why, what is the benefit to the user, and how is the change
> > accomplished.  If you want to provide a State of Error Handling in
> > VFIO, append it after the main points of the commit log.  
> 
> ok.
> 
> > 
> > I also asked in my previous comments to provide examples of errors that
> > might trigger correctable errors to the user, this comment seems to
> > have been missed.  In my experience, AERs generated during device
> > assignment are generally hardware faults or induced by bad guest
> > drivers.  These are cases where a single fatal error is an appropriate
> > and sufficient response.  We've scaled back this support to the point
> > where we're only improving the situation of correctable errors and I'm
> > not convinced this is worthwhile and we're not simply checking a box on
> > an ill-conceived marketing requirements document.  
> 
> Sorry. I noticed that question: "what actual errors do we expect
> userspace to see as non-fatal errors?", but I am confused about it.
> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> and Uncorrectable Error Severity Register will tell which is fatal, and
> which is non-fatal, this register is configurable, they are device
> specific as I guess. AER core driver distinguish them by
> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> question. Or
> 
> Or, Do you mean we could list the default non-fatal error of
> Uncorrectable Error Severity Register which is provided by PCIe spec?

I'm trying to ask why is this patch series useful.  It's clearly
possible for us to signal non-fatal errors for a device to a guest, but
why is it necessarily a good idea to do so?  What additional RAS
feature is gained by this?  Can we give a single example of a real
world scenario where a guest has been shutdown due to a non-fatal error
that the guest driver would have handled?

> > I had also commented asking how the hypervisor is expected to know
> > whether the guest 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Alex Williamson
On Tue, 28 Mar 2017 21:47:00 +0800
Cao jin  wrote:

> On 03/25/2017 06:12 AM, Alex Williamson wrote:
> > On Thu, 23 Mar 2017 17:07:31 +0800
> > Cao jin  wrote:
> > 
> > A more appropriate patch subject would be:
> > 
> > vfio-pci: Report correctable errors and slot reset events to user
> >  
> 
> Correctable? It is confusing to me. Correctable error has its clear
> definition in PCIe spec, shouldn't it be "non-fatal"?

My mistake, non-fatal.
 
> >> From: "Michael S. Tsirkin"   
> > 
> > This hardly seems accurate anymore.  You could say Suggested-by and let
> > Michael add a sign-off, but it's changed since he sent it.
> >   
> >>
> >> 0. What happens now (PCIE AER only)
> >>Fatal errors cause a link reset. Non fatal errors don't.
> >>All errors stop the QEMU guest eventually, but not immediately,
> >>because it's detected and reported asynchronously.
> >>Interrupts are forwarded as usual.
> >>Correctable errors are not reported to user at all.
> >>
> >>Note:
> >>PPC EEH is different, but this approach won't affect EEH. EEH treat
> >>all errors as fatal ones in AER, so they will still be signalled to user
> >>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
> >>to the same IOMMU group, so the slot_reset handler in this approach
> >>won't affect EEH either.
> >>
> >> 1. Correctable errors
> >>Hardware can correct these errors without software intervention,
> >>clear the error status is enough, this is what already done now.
> >>No need to recover it, nothing changed, leave it as it is.
> >>
> >> 2. Fatal errors
> >>They will induce a link reset. This is troublesome when user is
> >>a QEMU guest. This approach doesn't touch the existing mechanism.
> >>
> >> 3. Non-fatal errors
> >>Before this patch, they are signalled to user the same way as fatal 
> >> ones.
> >>With this patch, a new eventfd is introduced only for non-fatal error
> >>notification. By splitting non-fatal ones out, it will benefit AER
> >>recovery of a QEMU guest user.
> >>
> >>To maintain backwards compatibility with userspace, non-fatal errors
> >>will continue to trigger via the existing error interrupt index if a
> >>non-fatal signaling mechanism has not been registered.
> >>
> >>Note:
> >>In case of PCI Express errors, kernel might request a slot reset
> >>affecting our device (from our point of view this is a passive device
> >>reset as opposed to an active one requested by vfio itself).
> >>This might currently happen if a slot reset is requested by a driver
> >>(other than vfio) bound to another device function in the same slot.
> >>This will cause our device to lose its state so report this event to
> >>userspace.  
> > 
> > I tried to convey this in my last comments, I don't think this is an
> > appropriate commit log.  Lead with what is the problem you're trying to
> > fix and why, what is the benefit to the user, and how is the change
> > accomplished.  If you want to provide a State of Error Handling in
> > VFIO, append it after the main points of the commit log.  
> 
> ok.
> 
> > 
> > I also asked in my previous comments to provide examples of errors that
> > might trigger correctable errors to the user, this comment seems to
> > have been missed.  In my experience, AERs generated during device
> > assignment are generally hardware faults or induced by bad guest
> > drivers.  These are cases where a single fatal error is an appropriate
> > and sufficient response.  We've scaled back this support to the point
> > where we're only improving the situation of correctable errors and I'm
> > not convinced this is worthwhile and we're not simply checking a box on
> > an ill-conceived marketing requirements document.  
> 
> Sorry. I noticed that question: "what actual errors do we expect
> userspace to see as non-fatal errors?", but I am confused about it.
> Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
> and Uncorrectable Error Severity Register will tell which is fatal, and
> which is non-fatal, this register is configurable, they are device
> specific as I guess. AER core driver distinguish them by
> pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
> question. Or
> 
> Or, Do you mean we could list the default non-fatal error of
> Uncorrectable Error Severity Register which is provided by PCIe spec?

I'm trying to ask why is this patch series useful.  It's clearly
possible for us to signal non-fatal errors for a device to a guest, but
why is it necessarily a good idea to do so?  What additional RAS
feature is gained by this?  Can we give a single example of a real
world scenario where a guest has been shutdown due to a non-fatal error
that the guest driver would have handled?

> > I had also commented asking how the hypervisor is expected to know
> > whether the guest supports AER.  With the existing support of a single
> > fatal error, the 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Cao jin


On 03/25/2017 06:12 AM, Alex Williamson wrote:
> On Thu, 23 Mar 2017 17:07:31 +0800
> Cao jin  wrote:
> 
> A more appropriate patch subject would be:
> 
> vfio-pci: Report correctable errors and slot reset events to user
>

Correctable? It is confusing to me. Correctable error has its clear
definition in PCIe spec, shouldn't it be "non-fatal"?

>> From: "Michael S. Tsirkin" 
> 
> This hardly seems accurate anymore.  You could say Suggested-by and let
> Michael add a sign-off, but it's changed since he sent it.
> 
>>
>> 0. What happens now (PCIE AER only)
>>Fatal errors cause a link reset. Non fatal errors don't.
>>All errors stop the QEMU guest eventually, but not immediately,
>>because it's detected and reported asynchronously.
>>Interrupts are forwarded as usual.
>>Correctable errors are not reported to user at all.
>>
>>Note:
>>PPC EEH is different, but this approach won't affect EEH. EEH treat
>>all errors as fatal ones in AER, so they will still be signalled to user
>>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
>>to the same IOMMU group, so the slot_reset handler in this approach
>>won't affect EEH either.
>>
>> 1. Correctable errors
>>Hardware can correct these errors without software intervention,
>>clear the error status is enough, this is what already done now.
>>No need to recover it, nothing changed, leave it as it is.
>>
>> 2. Fatal errors
>>They will induce a link reset. This is troublesome when user is
>>a QEMU guest. This approach doesn't touch the existing mechanism.
>>
>> 3. Non-fatal errors
>>Before this patch, they are signalled to user the same way as fatal ones.
>>With this patch, a new eventfd is introduced only for non-fatal error
>>notification. By splitting non-fatal ones out, it will benefit AER
>>recovery of a QEMU guest user.
>>
>>To maintain backwards compatibility with userspace, non-fatal errors
>>will continue to trigger via the existing error interrupt index if a
>>non-fatal signaling mechanism has not been registered.
>>
>>Note:
>>In case of PCI Express errors, kernel might request a slot reset
>>affecting our device (from our point of view this is a passive device
>>reset as opposed to an active one requested by vfio itself).
>>This might currently happen if a slot reset is requested by a driver
>>(other than vfio) bound to another device function in the same slot.
>>This will cause our device to lose its state so report this event to
>>userspace.
> 
> I tried to convey this in my last comments, I don't think this is an
> appropriate commit log.  Lead with what is the problem you're trying to
> fix and why, what is the benefit to the user, and how is the change
> accomplished.  If you want to provide a State of Error Handling in
> VFIO, append it after the main points of the commit log.

ok.

> 
> I also asked in my previous comments to provide examples of errors that
> might trigger correctable errors to the user, this comment seems to
> have been missed.  In my experience, AERs generated during device
> assignment are generally hardware faults or induced by bad guest
> drivers.  These are cases where a single fatal error is an appropriate
> and sufficient response.  We've scaled back this support to the point
> where we're only improving the situation of correctable errors and I'm
> not convinced this is worthwhile and we're not simply checking a box on
> an ill-conceived marketing requirements document.

Sorry. I noticed that question: "what actual errors do we expect
userspace to see as non-fatal errors?", but I am confused about it.
Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
and Uncorrectable Error Severity Register will tell which is fatal, and
which is non-fatal, this register is configurable, they are device
specific as I guess. AER core driver distinguish them by
pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
question. Or

Or, Do you mean we could list the default non-fatal error of
Uncorrectable Error Severity Register which is provided by PCIe spec?

> 
> I had also commented asking how the hypervisor is expected to know
> whether the guest supports AER.  With the existing support of a single
> fatal error, the hypervisor halts the VM regardless of the error
> severity or guest support.  Now we have the opportunity that the
> hypervisor can forward a correctable error to the guest... and hope the
> right thing occurs?  I never saw any response to this comment.
> 

I noticed this question, you said: "That doesn't imply a problem with
this approach, the user (hypervisor) would be at fault for any
difference in handling in that case.". Maybe I understand you wrong.

>From my limit understanding, QEMU doesn't has a way to know whether a
guest has AER support, AER support need several kbuild configuration, I
don't know 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-28 Thread Cao jin


On 03/25/2017 06:12 AM, Alex Williamson wrote:
> On Thu, 23 Mar 2017 17:07:31 +0800
> Cao jin  wrote:
> 
> A more appropriate patch subject would be:
> 
> vfio-pci: Report correctable errors and slot reset events to user
>

Correctable? It is confusing to me. Correctable error has its clear
definition in PCIe spec, shouldn't it be "non-fatal"?

>> From: "Michael S. Tsirkin" 
> 
> This hardly seems accurate anymore.  You could say Suggested-by and let
> Michael add a sign-off, but it's changed since he sent it.
> 
>>
>> 0. What happens now (PCIE AER only)
>>Fatal errors cause a link reset. Non fatal errors don't.
>>All errors stop the QEMU guest eventually, but not immediately,
>>because it's detected and reported asynchronously.
>>Interrupts are forwarded as usual.
>>Correctable errors are not reported to user at all.
>>
>>Note:
>>PPC EEH is different, but this approach won't affect EEH. EEH treat
>>all errors as fatal ones in AER, so they will still be signalled to user
>>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
>>to the same IOMMU group, so the slot_reset handler in this approach
>>won't affect EEH either.
>>
>> 1. Correctable errors
>>Hardware can correct these errors without software intervention,
>>clear the error status is enough, this is what already done now.
>>No need to recover it, nothing changed, leave it as it is.
>>
>> 2. Fatal errors
>>They will induce a link reset. This is troublesome when user is
>>a QEMU guest. This approach doesn't touch the existing mechanism.
>>
>> 3. Non-fatal errors
>>Before this patch, they are signalled to user the same way as fatal ones.
>>With this patch, a new eventfd is introduced only for non-fatal error
>>notification. By splitting non-fatal ones out, it will benefit AER
>>recovery of a QEMU guest user.
>>
>>To maintain backwards compatibility with userspace, non-fatal errors
>>will continue to trigger via the existing error interrupt index if a
>>non-fatal signaling mechanism has not been registered.
>>
>>Note:
>>In case of PCI Express errors, kernel might request a slot reset
>>affecting our device (from our point of view this is a passive device
>>reset as opposed to an active one requested by vfio itself).
>>This might currently happen if a slot reset is requested by a driver
>>(other than vfio) bound to another device function in the same slot.
>>This will cause our device to lose its state so report this event to
>>userspace.
> 
> I tried to convey this in my last comments, I don't think this is an
> appropriate commit log.  Lead with what is the problem you're trying to
> fix and why, what is the benefit to the user, and how is the change
> accomplished.  If you want to provide a State of Error Handling in
> VFIO, append it after the main points of the commit log.

ok.

> 
> I also asked in my previous comments to provide examples of errors that
> might trigger correctable errors to the user, this comment seems to
> have been missed.  In my experience, AERs generated during device
> assignment are generally hardware faults or induced by bad guest
> drivers.  These are cases where a single fatal error is an appropriate
> and sufficient response.  We've scaled back this support to the point
> where we're only improving the situation of correctable errors and I'm
> not convinced this is worthwhile and we're not simply checking a box on
> an ill-conceived marketing requirements document.

Sorry. I noticed that question: "what actual errors do we expect
userspace to see as non-fatal errors?", but I am confused about it.
Correctable, non-fatal, fatal errors are clearly defined in PCIe spec,
and Uncorrectable Error Severity Register will tell which is fatal, and
which is non-fatal, this register is configurable, they are device
specific as I guess. AER core driver distinguish them by
pci_channel_io_normal/pci_channel_io_frozen,  So I don't understand your
question. Or

Or, Do you mean we could list the default non-fatal error of
Uncorrectable Error Severity Register which is provided by PCIe spec?

> 
> I had also commented asking how the hypervisor is expected to know
> whether the guest supports AER.  With the existing support of a single
> fatal error, the hypervisor halts the VM regardless of the error
> severity or guest support.  Now we have the opportunity that the
> hypervisor can forward a correctable error to the guest... and hope the
> right thing occurs?  I never saw any response to this comment.
> 

I noticed this question, you said: "That doesn't imply a problem with
this approach, the user (hypervisor) would be at fault for any
difference in handling in that case.". Maybe I understand you wrong.

>From my limit understanding, QEMU doesn't has a way to know whether a
guest has AER support, AER support need several kbuild configuration, I
don't know how qemu is expected to know these.

>>
>> 

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-24 Thread Alex Williamson
On Thu, 23 Mar 2017 17:07:31 +0800
Cao jin  wrote:

A more appropriate patch subject would be:

vfio-pci: Report correctable errors and slot reset events to user

> From: "Michael S. Tsirkin" 

This hardly seems accurate anymore.  You could say Suggested-by and let
Michael add a sign-off, but it's changed since he sent it.

> 
> 0. What happens now (PCIE AER only)
>Fatal errors cause a link reset. Non fatal errors don't.
>All errors stop the QEMU guest eventually, but not immediately,
>because it's detected and reported asynchronously.
>Interrupts are forwarded as usual.
>Correctable errors are not reported to user at all.
> 
>Note:
>PPC EEH is different, but this approach won't affect EEH. EEH treat
>all errors as fatal ones in AER, so they will still be signalled to user
>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
>to the same IOMMU group, so the slot_reset handler in this approach
>won't affect EEH either.
> 
> 1. Correctable errors
>Hardware can correct these errors without software intervention,
>clear the error status is enough, this is what already done now.
>No need to recover it, nothing changed, leave it as it is.
> 
> 2. Fatal errors
>They will induce a link reset. This is troublesome when user is
>a QEMU guest. This approach doesn't touch the existing mechanism.
> 
> 3. Non-fatal errors
>Before this patch, they are signalled to user the same way as fatal ones.
>With this patch, a new eventfd is introduced only for non-fatal error
>notification. By splitting non-fatal ones out, it will benefit AER
>recovery of a QEMU guest user.
> 
>To maintain backwards compatibility with userspace, non-fatal errors
>will continue to trigger via the existing error interrupt index if a
>non-fatal signaling mechanism has not been registered.
> 
>Note:
>In case of PCI Express errors, kernel might request a slot reset
>affecting our device (from our point of view this is a passive device
>reset as opposed to an active one requested by vfio itself).
>This might currently happen if a slot reset is requested by a driver
>(other than vfio) bound to another device function in the same slot.
>This will cause our device to lose its state so report this event to
>userspace.

I tried to convey this in my last comments, I don't think this is an
appropriate commit log.  Lead with what is the problem you're trying to
fix and why, what is the benefit to the user, and how is the change
accomplished.  If you want to provide a State of Error Handling in
VFIO, append it after the main points of the commit log.

I also asked in my previous comments to provide examples of errors that
might trigger correctable errors to the user, this comment seems to
have been missed.  In my experience, AERs generated during device
assignment are generally hardware faults or induced by bad guest
drivers.  These are cases where a single fatal error is an appropriate
and sufficient response.  We've scaled back this support to the point
where we're only improving the situation of correctable errors and I'm
not convinced this is worthwhile and we're not simply checking a box on
an ill-conceived marketing requirements document.

I had also commented asking how the hypervisor is expected to know
whether the guest supports AER.  With the existing support of a single
fatal error, the hypervisor halts the VM regardless of the error
severity or guest support.  Now we have the opportunity that the
hypervisor can forward a correctable error to the guest... and hope the
right thing occurs?  I never saw any response to this comment.

> 
> Signed-off-by: Michael S. Tsirkin 
> Signed-off-by: Cao jin 
> ---
> v6 changelog:
> Address all the comments from MST.
> 
>  drivers/vfio/pci/vfio_pci.c | 49 
> +++--
>  drivers/vfio/pci/vfio_pci_intrs.c   | 38 
>  drivers/vfio/pci/vfio_pci_private.h |  2 ++
>  include/uapi/linux/vfio.h   |  2 ++
>  4 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 324c52e..71f9a8a 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -441,7 +441,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
> *vdev, int irq_type)
>  
>   return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>   }
> - } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> + } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX ||
> +irq_type == VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX ||
> +irq_type == VFIO_PCI_PASSIVE_RESET_IRQ_INDEX) {

Should we add a typdef to alias VFIO_PCI_ERR_IRQ_INDEX to
VFIO_PCI_FATAL_ERR_IRQ?

>   if (pci_is_pcie(vdev->pdev))
>   return 1;
>  

Re: [PATCH v6] vfio error recovery: kernel support

2017-03-24 Thread Alex Williamson
On Thu, 23 Mar 2017 17:07:31 +0800
Cao jin  wrote:

A more appropriate patch subject would be:

vfio-pci: Report correctable errors and slot reset events to user

> From: "Michael S. Tsirkin" 

This hardly seems accurate anymore.  You could say Suggested-by and let
Michael add a sign-off, but it's changed since he sent it.

> 
> 0. What happens now (PCIE AER only)
>Fatal errors cause a link reset. Non fatal errors don't.
>All errors stop the QEMU guest eventually, but not immediately,
>because it's detected and reported asynchronously.
>Interrupts are forwarded as usual.
>Correctable errors are not reported to user at all.
> 
>Note:
>PPC EEH is different, but this approach won't affect EEH. EEH treat
>all errors as fatal ones in AER, so they will still be signalled to user
>via the legacy eventfd.  Besides, all devices/functions in a PE belongs
>to the same IOMMU group, so the slot_reset handler in this approach
>won't affect EEH either.
> 
> 1. Correctable errors
>Hardware can correct these errors without software intervention,
>clear the error status is enough, this is what already done now.
>No need to recover it, nothing changed, leave it as it is.
> 
> 2. Fatal errors
>They will induce a link reset. This is troublesome when user is
>a QEMU guest. This approach doesn't touch the existing mechanism.
> 
> 3. Non-fatal errors
>Before this patch, they are signalled to user the same way as fatal ones.
>With this patch, a new eventfd is introduced only for non-fatal error
>notification. By splitting non-fatal ones out, it will benefit AER
>recovery of a QEMU guest user.
> 
>To maintain backwards compatibility with userspace, non-fatal errors
>will continue to trigger via the existing error interrupt index if a
>non-fatal signaling mechanism has not been registered.
> 
>Note:
>In case of PCI Express errors, kernel might request a slot reset
>affecting our device (from our point of view this is a passive device
>reset as opposed to an active one requested by vfio itself).
>This might currently happen if a slot reset is requested by a driver
>(other than vfio) bound to another device function in the same slot.
>This will cause our device to lose its state so report this event to
>userspace.

I tried to convey this in my last comments, I don't think this is an
appropriate commit log.  Lead with what is the problem you're trying to
fix and why, what is the benefit to the user, and how is the change
accomplished.  If you want to provide a State of Error Handling in
VFIO, append it after the main points of the commit log.

I also asked in my previous comments to provide examples of errors that
might trigger correctable errors to the user, this comment seems to
have been missed.  In my experience, AERs generated during device
assignment are generally hardware faults or induced by bad guest
drivers.  These are cases where a single fatal error is an appropriate
and sufficient response.  We've scaled back this support to the point
where we're only improving the situation of correctable errors and I'm
not convinced this is worthwhile and we're not simply checking a box on
an ill-conceived marketing requirements document.

I had also commented asking how the hypervisor is expected to know
whether the guest supports AER.  With the existing support of a single
fatal error, the hypervisor halts the VM regardless of the error
severity or guest support.  Now we have the opportunity that the
hypervisor can forward a correctable error to the guest... and hope the
right thing occurs?  I never saw any response to this comment.

> 
> Signed-off-by: Michael S. Tsirkin 
> Signed-off-by: Cao jin 
> ---
> v6 changelog:
> Address all the comments from MST.
> 
>  drivers/vfio/pci/vfio_pci.c | 49 
> +++--
>  drivers/vfio/pci/vfio_pci_intrs.c   | 38 
>  drivers/vfio/pci/vfio_pci_private.h |  2 ++
>  include/uapi/linux/vfio.h   |  2 ++
>  4 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 324c52e..71f9a8a 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -441,7 +441,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
> *vdev, int irq_type)
>  
>   return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>   }
> - } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> + } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX ||
> +irq_type == VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX ||
> +irq_type == VFIO_PCI_PASSIVE_RESET_IRQ_INDEX) {

Should we add a typdef to alias VFIO_PCI_ERR_IRQ_INDEX to
VFIO_PCI_FATAL_ERR_IRQ?

>   if (pci_is_pcie(vdev->pdev))
>   return 1;
>   } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> @@ -796,6 +798,8 @@ static 

[PATCH v6] vfio error recovery: kernel support

2017-03-23 Thread Cao jin
From: "Michael S. Tsirkin" 

0. What happens now (PCIE AER only)
   Fatal errors cause a link reset. Non fatal errors don't.
   All errors stop the QEMU guest eventually, but not immediately,
   because it's detected and reported asynchronously.
   Interrupts are forwarded as usual.
   Correctable errors are not reported to user at all.

   Note:
   PPC EEH is different, but this approach won't affect EEH. EEH treat
   all errors as fatal ones in AER, so they will still be signalled to user
   via the legacy eventfd.  Besides, all devices/functions in a PE belongs
   to the same IOMMU group, so the slot_reset handler in this approach
   won't affect EEH either.

1. Correctable errors
   Hardware can correct these errors without software intervention,
   clear the error status is enough, this is what already done now.
   No need to recover it, nothing changed, leave it as it is.

2. Fatal errors
   They will induce a link reset. This is troublesome when user is
   a QEMU guest. This approach doesn't touch the existing mechanism.

3. Non-fatal errors
   Before this patch, they are signalled to user the same way as fatal ones.
   With this patch, a new eventfd is introduced only for non-fatal error
   notification. By splitting non-fatal ones out, it will benefit AER
   recovery of a QEMU guest user.

   To maintain backwards compatibility with userspace, non-fatal errors
   will continue to trigger via the existing error interrupt index if a
   non-fatal signaling mechanism has not been registered.

   Note:
   In case of PCI Express errors, kernel might request a slot reset
   affecting our device (from our point of view this is a passive device
   reset as opposed to an active one requested by vfio itself).
   This might currently happen if a slot reset is requested by a driver
   (other than vfio) bound to another device function in the same slot.
   This will cause our device to lose its state so report this event to
   userspace.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Cao jin 
---
v6 changelog:
Address all the comments from MST.

 drivers/vfio/pci/vfio_pci.c | 49 +++--
 drivers/vfio/pci/vfio_pci_intrs.c   | 38 
 drivers/vfio/pci/vfio_pci_private.h |  2 ++
 include/uapi/linux/vfio.h   |  2 ++
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 324c52e..71f9a8a 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -441,7 +441,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
*vdev, int irq_type)
 
return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
}
-   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX ||
+  irq_type == VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX ||
+  irq_type == VFIO_PCI_PASSIVE_RESET_IRQ_INDEX) {
if (pci_is_pcie(vdev->pdev))
return 1;
} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
@@ -796,6 +798,8 @@ static long vfio_pci_ioctl(void *device_data,
case VFIO_PCI_REQ_IRQ_INDEX:
break;
case VFIO_PCI_ERR_IRQ_INDEX:
+   case VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX:
+   case VFIO_PCI_PASSIVE_RESET_IRQ_INDEX:
if (pci_is_pcie(vdev->pdev))
break;
/* pass thru to return error */
@@ -1282,7 +1286,9 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct 
pci_dev *pdev,
 
mutex_lock(>igate);
 
-   if (vdev->err_trigger)
+   if (state == pci_channel_io_normal && vdev->non_fatal_err_trigger)
+   eventfd_signal(vdev->non_fatal_err_trigger, 1);
+   else if (vdev->err_trigger)
eventfd_signal(vdev->err_trigger, 1);
 
mutex_unlock(>igate);
@@ -1292,8 +1298,47 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct 
pci_dev *pdev,
return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+/*
+ * In case of PCI Express errors, kernel might request a slot reset
+ * affecting our device (from our point of view, this is a passive device
+ * reset as opposed to an active one requested by vfio itself).
+ * This might currently happen if a slot reset is requested by a driver
+ * (other than vfio) bound to another device function in the same slot.
+ * This will cause our device to lose its state, so report this event to
+ * userspace.
+ */
+static pci_ers_result_t vfio_pci_aer_slot_reset(struct pci_dev *pdev)
+{
+   struct vfio_pci_device *vdev;
+   struct vfio_device *device;
+   static pci_ers_result_t err = PCI_ERS_RESULT_NONE;
+
+   device = vfio_device_get_from_dev(>dev);
+   if (!device)
+   goto err_dev;
+
+   vdev = vfio_device_data(device);
+   if (!vdev)
+ 

[PATCH v6] vfio error recovery: kernel support

2017-03-23 Thread Cao jin
From: "Michael S. Tsirkin" 

0. What happens now (PCIE AER only)
   Fatal errors cause a link reset. Non fatal errors don't.
   All errors stop the QEMU guest eventually, but not immediately,
   because it's detected and reported asynchronously.
   Interrupts are forwarded as usual.
   Correctable errors are not reported to user at all.

   Note:
   PPC EEH is different, but this approach won't affect EEH. EEH treat
   all errors as fatal ones in AER, so they will still be signalled to user
   via the legacy eventfd.  Besides, all devices/functions in a PE belongs
   to the same IOMMU group, so the slot_reset handler in this approach
   won't affect EEH either.

1. Correctable errors
   Hardware can correct these errors without software intervention,
   clear the error status is enough, this is what already done now.
   No need to recover it, nothing changed, leave it as it is.

2. Fatal errors
   They will induce a link reset. This is troublesome when user is
   a QEMU guest. This approach doesn't touch the existing mechanism.

3. Non-fatal errors
   Before this patch, they are signalled to user the same way as fatal ones.
   With this patch, a new eventfd is introduced only for non-fatal error
   notification. By splitting non-fatal ones out, it will benefit AER
   recovery of a QEMU guest user.

   To maintain backwards compatibility with userspace, non-fatal errors
   will continue to trigger via the existing error interrupt index if a
   non-fatal signaling mechanism has not been registered.

   Note:
   In case of PCI Express errors, kernel might request a slot reset
   affecting our device (from our point of view this is a passive device
   reset as opposed to an active one requested by vfio itself).
   This might currently happen if a slot reset is requested by a driver
   (other than vfio) bound to another device function in the same slot.
   This will cause our device to lose its state so report this event to
   userspace.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Cao jin 
---
v6 changelog:
Address all the comments from MST.

 drivers/vfio/pci/vfio_pci.c | 49 +++--
 drivers/vfio/pci/vfio_pci_intrs.c   | 38 
 drivers/vfio/pci/vfio_pci_private.h |  2 ++
 include/uapi/linux/vfio.h   |  2 ++
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 324c52e..71f9a8a 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -441,7 +441,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
*vdev, int irq_type)
 
return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
}
-   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX ||
+  irq_type == VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX ||
+  irq_type == VFIO_PCI_PASSIVE_RESET_IRQ_INDEX) {
if (pci_is_pcie(vdev->pdev))
return 1;
} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
@@ -796,6 +798,8 @@ static long vfio_pci_ioctl(void *device_data,
case VFIO_PCI_REQ_IRQ_INDEX:
break;
case VFIO_PCI_ERR_IRQ_INDEX:
+   case VFIO_PCI_NON_FATAL_ERR_IRQ_INDEX:
+   case VFIO_PCI_PASSIVE_RESET_IRQ_INDEX:
if (pci_is_pcie(vdev->pdev))
break;
/* pass thru to return error */
@@ -1282,7 +1286,9 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct 
pci_dev *pdev,
 
mutex_lock(>igate);
 
-   if (vdev->err_trigger)
+   if (state == pci_channel_io_normal && vdev->non_fatal_err_trigger)
+   eventfd_signal(vdev->non_fatal_err_trigger, 1);
+   else if (vdev->err_trigger)
eventfd_signal(vdev->err_trigger, 1);
 
mutex_unlock(>igate);
@@ -1292,8 +1298,47 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct 
pci_dev *pdev,
return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+/*
+ * In case of PCI Express errors, kernel might request a slot reset
+ * affecting our device (from our point of view, this is a passive device
+ * reset as opposed to an active one requested by vfio itself).
+ * This might currently happen if a slot reset is requested by a driver
+ * (other than vfio) bound to another device function in the same slot.
+ * This will cause our device to lose its state, so report this event to
+ * userspace.
+ */
+static pci_ers_result_t vfio_pci_aer_slot_reset(struct pci_dev *pdev)
+{
+   struct vfio_pci_device *vdev;
+   struct vfio_device *device;
+   static pci_ers_result_t err = PCI_ERS_RESULT_NONE;
+
+   device = vfio_device_get_from_dev(>dev);
+   if (!device)
+   goto err_dev;
+
+   vdev = vfio_device_data(device);
+   if (!vdev)
+   goto err_data;
+
+   mutex_lock(>igate);