Re: [Intel-wired-lan] [PATCH iwl-next v1] ice: fw and port health status

Tony Nguyen Fri, 22 Nov 2024 11:33:29 -0800



On 11/18/2024 2:48 AM, Konrad Knitter wrote:

Firmware generates events for global events or port specific events.

Driver shall subscribe for health status events from firmware on supported
FW versions >= 1.7.6.
Driver shall expose those under specific health reporter, two new
reporters are introduced:
- FW health reporter shall represent global events (problems with the
image, recovery mode);
- Port health reporter shall represent port-specific events (module
failure).

Firmware only reports problems when those are detected, it does not store
active fault list.
Driver will hold only last global and last port-specific event.
Driver will report all events via devlink health report,
so in case of multiple events of the same source they can be reviewed
using devlink autodump feature.

$ devlink health

pci/0000:b1:00.3:
   reporter fw
     state healthy error 0 recover 0 auto_dump true
   reporter port
     state error error 1 recover 0 last_dump_date 2024-03-17
        last_dump_time 09:29:29 auto_dump true

$ devlink health diagnose pci/0000:b1:00.3 reporter port

   Syndrome: 262
   Description: Module is not present.
   Possible Solution: Check that the module is inserted correctly.
   Port Number: 0

Tested on Intel Corporation Ethernet Controller E810-C for SFP

Co-developed-by: Sharon Haroni <[email protected]>
Signed-off-by: Sharon Haroni <[email protected]>
Co-developed-by: Nicholas Nunley <[email protected]>
Signed-off-by: Nicholas Nunley <[email protected]>
Co-developed-by: Brett Creeley <[email protected]>
Signed-off-by: Brett Creeley <[email protected]>
Signed-off-by: Konrad Knitter <[email protected]>
---
  .../net/ethernet/intel/ice/devlink/health.c   | 290 +++++++++++++++++-
  .../net/ethernet/intel/ice/devlink/health.h   |  12 +
  .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  87 ++++++
  drivers/net/ethernet/intel/ice/ice_common.c   |  37 +++
  drivers/net/ethernet/intel/ice/ice_common.h   |   2 +
  drivers/net/ethernet/intel/ice/ice_main.c     |   3 +
  drivers/net/ethernet/intel/ice/ice_type.h     |   5 +
  7 files changed, 429 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/devlink/health.c 
b/drivers/net/ethernet/intel/ice/devlink/health.c
index c7a8b8c9e1ca..4e6c6891e207 100644
--- a/drivers/net/ethernet/intel/ice/devlink/health.c
+++ b/drivers/net/ethernet/intel/ice/devlink/health.c
@@ -1,13 +1,272 @@
  // SPDX-License-Identifier: GPL-2.0
  /* Copyright (c) 2024, Intel Corporation. */

-#include "health.h"

  #include "ice.h"
+#include "ice_adminq_cmd.h" /* for enum ice_aqc_health_status_elem */
+#include "health.h"


Is there a reason you're re-ordering health.h?

  #include "ice_ethtool_common.h"

#define ICE_DEVLINK_FMSG_PUT_FIELD(fmsg, obj, name) \

        devlink_fmsg_put(fmsg, #name, (obj)->name)

+#define ICE_HEALTH_STATUS_DATA_SIZE 2

+
+struct ice_health_status {
+       enum ice_aqc_health_status code;
+       const char *description;
+       const char *solution;
+       const char *data_label[ICE_HEALTH_STATUS_DATA_SIZE];
+};
+
+/**


Wrong style, should be '/*'

drivers/net/ethernet/intel/ice/devlink/health.c:22: warning: Thiscomment starts with '/**', but isn't a kernel-doc comment. ReferDocumentation/doc-guide/kernel-doc.rst

+ * In addition to the health status codes provided below, the firmware might
+ * generate Health Status Codes that are not pertinent to the end-user.
+ * For instance, Health Code 0x1002 is triggered when the command fails.
+ * Such codes should be disregarded by the end-user.
+ * The below lookup requires to be sorted by code.
+ */
+
+static const char *const ice_common_port_solutions =
+       "Check your cable connection. Change or replace the module or cable. 
Manually set speed and duplex.";
+static const char *const ice_port_number_label = "Port Number";
+static const char *const ice_update_nvm_solution = "Update to the latest NVM 
image.";

...

+static void ice_describe_status_code(struct devlink_fmsg *fmsg,
+                                    struct ice_aqc_health_status_elem *hse)
+{
+       static const char *const aux_label[] = { "Aux Data 1", "Aux Data 2" };
+       const struct ice_health_status *health_code;
+       u32 internal_data[2];
+       u16 status_code;
+
+       status_code = le16_to_cpu(hse->health_status_code);
+
+       devlink_fmsg_put(fmsg, "Syndrome", status_code);
+       if (status_code != 0) {


if (status_code) {...

+               internal_data[0] = le32_to_cpu(hse->internal_data1);
+               internal_data[1] = le32_to_cpu(hse->internal_data2);
+
+               health_code = ice_get_health_status(status_code);
+
+               if (!health_code)
+                       return;

Please don't separate the error check with a newline. Other occurrencesin this patch as well, please fix those too.

+
+               devlink_fmsg_string_pair_put(fmsg, "Description", 
health_code->description);
+
+               if (health_code->solution)
+                       devlink_fmsg_string_pair_put(fmsg, "Possible Solution",
+                                                    health_code->solution);
+
+               for (int i = 0; i < ICE_HEALTH_STATUS_DATA_SIZE; i++) {
+                       if (internal_data[i] != 
ICE_AQC_HEALTH_STATUS_UNDEFINED_DATA)
+                               devlink_fmsg_u32_pair_put(fmsg,
+                                                         
health_code->data_label[i] ?
+                                                         
health_code->data_label[i] :
+                                                         aux_label[i],
+                                                         internal_data[i]);
+               }
+       }
+}
+

...

+void ice_process_health_status_event(struct ice_pf *pf, struct 
ice_rq_event_info *event)
+{
+       const struct ice_aqc_health_status_elem *health_info;
+       const struct ice_health_status *health_code;
+       u16 status_code, count;
+
+       health_info = (struct ice_aqc_health_status_elem *)event->msg_buf;
+       count = 
le16_to_cpu(event->desc.params.get_health_status.health_status_count);
+
+       if (count > (event->buf_len / sizeof(*health_info))) {
+               dev_err(ice_pf_to_dev(pf), "Received a health status event with 
invalid element count\n");
+               return;
+       }
+
+       for (int i = 0; i < count; i++) {
+               status_code = le16_to_cpu(health_info->health_status_code);
+               health_code = ice_get_health_status(status_code);


Looks like the scope of these vars can be reduced to this loop.

+
+               if (health_code) {
+                       switch (health_info->event_source) {
+                       case ICE_AQC_HEALTH_STATUS_GLOBAL:
+                               pf->health_reporters.fw_status = *health_info;
+                               devlink_health_report(pf->health_reporters.fw,
+                                                     "FW syndrome reported", 
NULL);
+                               break;
+                       case ICE_AQC_HEALTH_STATUS_PF:
+                       case ICE_AQC_HEALTH_STATUS_PORT:
+                               pf->health_reporters.port_status = *health_info;
+                               devlink_health_report(pf->health_reporters.port,
+                                                     "Port syndrome reported", 
NULL);
+                               break;
+                       default:
+                               dev_err(ice_pf_to_dev(pf), "Health code with unknown 
source\n");
+                       }
+               } else {
+                       u32 data1, data2;
+                       u16 source;
+
+                       source = le16_to_cpu(health_info->event_source);
+                       data1 = le32_to_cpu(health_info->internal_data1);
+                       data2 = le32_to_cpu(health_info->internal_data2);
+                       dev_dbg(ice_pf_to_dev(pf),
+                               "Received internal health status code 0x%08x, 
source: 0x%08x, data1: 0x%08x, data2: 0x%08x",
+                               status_code, source, data1, data2);
+               }
+               health_info++;
+       }
+}

...

@@ -27,15 +29,21 @@ enum ice_mdd_src {
   * struct ice_health - stores ice devlink health reporters and accompanied 
data
   * @tx_hang: devlink health reporter for tx_hang event
   * @mdd: devlink health reporter for MDD detection event
+ * @fw: devlink health reporter for FW Health Status events
+ * @port: devlink health reporter for Port Health Status events


These should be in the order of the struct i.e. 'mdd' should be in-between.

   * @tx_hang_buf: pre-allocated place to put info for Tx hang reporter from
   *               non-sleeping context
   * @tx_ring: ring that the hang occured on
   * @head: descriptior head
   * @intr: interrupt register value
   * @vsi_num: VSI owning the queue that the hang occured on
+ * @fw_status: buffer for last received FW Status event
+ * @port_status: buffer for last received Port Status event
   */
  struct ice_health {
+       struct devlink_health_reporter *fw;
        struct devlink_health_reporter *mdd;
+       struct devlink_health_reporter *port;
        struct devlink_health_reporter *tx_hang;
        struct_group_tagged(ice_health_tx_hang_buf, tx_hang_buf,
                struct ice_tx_ring *tx_ring;

...

+/**
+ * ice_is_fw_health_report_supported

drivers/net/ethernet/intel/ice/ice_common.c:6052: warning: missinginitial short description on line:

 * ice_is_fw_health_report_supported

+ * @hw: pointer to the hardware structure
+ *
+ * Return true if firmware supports health status reports,


Return isn't recognized, it should be Return:

drivers/net/ethernet/intel/ice/ice_common.c:6059: warning: Nodescription found for return value of 'ice_is_fw_health_report_supported'

+ * false otherwise
+ */
+bool ice_is_fw_health_report_supported(struct ice_hw *hw)
+{
+       return ice_is_fw_api_min_ver(hw, ICE_FW_API_HEALTH_REPORT_MAJ,
+                                    ICE_FW_API_HEALTH_REPORT_MIN,
+                                    ICE_FW_API_HEALTH_REPORT_PATCH);
+}
+
+/**
+ * ice_aq_set_health_status_cfg - Configure FW health events
+ * @hw: pointer to the HW struct
+ * @event_source: type of diagnostic events to enable
+ *
+ * Configure the health status event types that the firmware will send to this
+ * PF. The supported event types are: PF-specific, all PFs, and global.
+ * Return: 0 on success, negative error code otherwise.

IMO a newline separating the Return: would be make it easier todifferentiate.


Thanks,
Tony

+ */
+int ice_aq_set_health_status_cfg(struct ice_hw *hw, u8 event_source)

Re: [Intel-wired-lan] [PATCH iwl-next v1] ice: fw and port health status

Reply via email to