Roman Kagan <rvka...@yandex-team.ru> writes:

> On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
>> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
>> > Roman Kagan <rvka...@yandex-team.ru> writes:
>> > 
>> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
>> > >> Konstantin Khlebnikov <khlebni...@yandex-team.ru> writes:
>> > >> 
>> > >> > This event represents device runtime errors to give time and
>> > >> > reason why device is broken.
>> > >> 
>> > >> Can you give an or more examples of the "device runtime errors" you have
>> > >> in mind?
>> > >
>> > > Initially we wanted to address a situation when a vhost device
>> > > discovered an inconsistency during virtqueue processing and silently
>> > > stopped the virtqueue.  This resulted in device stall (partial for
>> > > multiqueue devices) and we were the last to notice that.
>> > >
>> > > The solution appeared to be to employ errfd and, upon receiving a
>> > > notification through it, to emit a QMP event which is actionable in the
>> > > management layer or further up the stack.
>> > >
>> > > Then we observed that virtio (non-vhost) devices suffer from the same
>> > > issue: they only log the error but don't signal it to the management
>> > > layer.  The case was very similar so we thought it would make sense to
>> > > share the infrastructure and the QMP event between virtio and vhost.
>> > >
>> > > Then Konstantin went a bit further and generalized the concept into
>> > > generic "device runtime error".  I'm personally not completely convinced
>> > > this generalization is appropriate here; we'd appreciate the opinions
>> > > from the community on the matter.
>> > 
>> > "Device emulation sending an even on entering certain error states, so
>> > that a management application can do something about it" feels
>> > reasonable enough to me as a general concept.
>> > 
>> > The key point is of course "can do something": the event needs to be
>> > actionable.  Can you describe possible actions for the cases you
>> > implement?
>> 
>> The first one that we had in mind was informational, like triggering an
>> alert in the monitoring system and/or painting the VM as malfunctioning
>> in the owner's UI.
>> 
>> There can be more advanced scenarios like autorecovery by resetting the
>> faulty VM, or fencing it if it's a cluster member.
>
> The discussion kind of stalled here.

My apologies...

>                                       Do you think the approach makes
> sense or not?  Should we try and resubmit the series with a proper cover
> letter and possibly other improvements or is it a dead end?

As QAPI schema maintainer, my concern is interface design.  To sell this
interface to me (so to speak), you have to show it's useful and
reasonably general.  Reasonably general, because we don't want to
accumulate one-offs, even if they have their uses.

I think this is mostly a matter of commit message(s) and documentation
here.  Explain your intended use cases.  Maybe hand-wave at other use
cases you can think of.  Document that you're implementing the event
only for the specific errors you need, but that it could be implemented
more widely as needed.  "Complete" feels impractical, though.

Makes sense?


Reply via email to