On Thu, 11 Mar 2021 16:23:09 +0200 Eran Ben Elisha wrote: > On 3/11/2021 5:26 AM, Jakub Kicinski wrote: > >>> Pending vendors adding the right reporters. << > > Would you like Nvidia to reply with the remedy per reporter or to > actually prepare the patch?
You mean the patch adding .remedy? If you can that'd be helpful. Or do you have HW error reporters to add? > > Extend the applicability of devlink health reporters > > beyond what can be locally remedied. Add failure modes > > which require re-flashing the NVM image or HW changes. > > > > The expectation is that driver will call > > devlink_health_reporter_state_update() to put hardware > > health reporters into bad state. > > > > Signed-off-by: Jakub Kicinski <k...@kernel.org> > > --- > > include/uapi/linux/devlink.h | 7 +++++++ > > net/core/devlink.c | 3 +-- > > 2 files changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > > index 8cd1508b525b..f623bbc63489 100644 > > --- a/include/uapi/linux/devlink.h > > +++ b/include/uapi/linux/devlink.h > > @@ -617,10 +617,17 @@ enum devlink_port_fn_opstate { > > * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery > > * may fix the issue, otherwise user needs to try > > * power cycling or other forms of reset > > + * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs > > + * to be re-written, usually due to block corruption > > + * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host > > + * or the connection between the two may be at fault > > */ > > enum devlink_health_state { > > DL_HEALTH_STATE_HEALTHY, > > DL_HEALTH_STATE_ERROR, > > + > > + DL_HEALTH_STATE_BAD_IMAGE, > > + DL_HEALTH_STATE_BAD_HW, > > }; > > > > /** > > diff --git a/net/core/devlink.c b/net/core/devlink.c > > index 09d77d43ff63..4a9fa6288a4a 100644 > > --- a/net/core/devlink.c > > +++ b/net/core/devlink.c > > @@ -6527,8 +6527,7 @@ void > > devlink_health_reporter_state_update(struct devlink_health_reporter > > *reporter, > > enum devlink_health_state state) > > { > > - if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY && > > - state != DL_HEALTH_STATE_ERROR)) > > + if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW)) > > return; > > > > if (reporter->health_state == state) > > > > devlink_health_reporter_recover() requires an update as well. > something like: > > @@ -6346,8 +6346,15 @@ devlink_health_reporter_recover(struct > devlink_health_reporter *reporter, > { > int err; > > - if (reporter->health_state == DL_HEALTH_STATE_HEALTHY) > + switch (reporter->health_state) { > + case DL_HEALTH_STATE_HEALTHY: > return 0; > + case DL_HEALTH_STATE_ERROR: > + break; > + case DL_HEALTH_STATE_BAD_IMAGE: > + case DL_HEALTH_STATE_BAD_HW: > + return -EOPNOTSUPP; > + } > > if (!reporter->ops->recover) > return -EOPNOTSUPP; > Thanks!