Re: [ndctl PATCH] Report the Health Status Detail for the HPE1 DSM family

Linda Knippers Fri, 09 Jun 2017 11:56:31 -0700

On 6/9/2017 2:10 PM, Dan Williams wrote:

This format for health_detail makes the json harder to consume for
upper-layer applications. Every field should be associated with a
key-id even if it's just a flag.



Ok.  I thought we had discussed this format with my RFC patch a
while back.  The only change I made was adding underscores instead
of spaces as requested.


Yeah, I thought I had brought up the array of json-boolean objects vs
json-string objects before.


Hmm...maybe I missed that, but ok.

The items in the array are all associated with a single validation flag.
There are some additional groups of fields that I plan to add as well,
like a group of error counters, so do you want everything flat?


Yes, after writing some scripts to parse json I think flatter is
better unless the data itself is actually hierarchical.

I think I'd like to keep the json
topology flatter. We already have support for the NFIT health state
flags in this form:

        {
          "provider":"nfit_test.1",
          "dev":"ndbus3",
          "dimms":[
            {
              "dev":"nmem5",
              "id":"cdab-0a-07e0-fffffeff",
              "flag_failed_save":true,
              "flag_failed_arm":true,
              "flag_failed_restore":true,
              "flag_smart_event":true,
              "flag_failed_flush":true
            }
          ]
        }

So, let's just add additional flag names and omit the ones that are
duplicates of the NFIT-level health state flags.



None of them are duplicates because the NFIT state flags are static from
boot
time while the health status detail changes at runtime.


Yes, but it's a pain for applications that now have to deal with two
different flag names for similar indicators. So for example if
"flag_failed_arm" is false at startup, but "arm_error" triggers later
I would think that would just be reflected in "flag_failed_arm". In
other words override the static flag_failed_arm value with the dynamic
result so that monitor applications need only learn one json key for
the same data.


But they're different things.  One represents boot time state and one represents
runtime.  One is an NFIT flag, not even reported as health today, and one is
runtime health.  Even your commit message says they're distinct from SMART
health data.  Knowing that my NVDIMM came up healthy and then had an
 issue,
or vice versa, is important.

Maybe the state flags should be renamed to reflect what they really are.
The "true" value for those is probably redundant too because I don't think you
output anything if they're fault, although now I'm not sure those NFIT
flags are being displayed properly at all.  On my system, /sys../nfit/flags
shows "smart_notify", which I assume means I should see "flag_smart_event":true,
but I don't.  I'll look at that separately.

-- ljk


_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [ndctl PATCH] Report the Health Status Detail for the HPE1 DSM family

Reply via email to