Re: [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check

Vaibhav Jain Tue, 31 Mar 2020 22:36:39 -0700


Thanks for reviewing this patch,


Dan Williams <[email protected]> writes:

> On Fri, Feb 21, 2020 at 9:55 AM Vaibhav Jain <[email protected]> wrote:
>>
>> This patch adds a new command argument to the 'monitor' command namely
>> '--check-interval' that triggers a call to notify_dimm_event() at
>> regular intervals forcing a periodic check of dimm smart events.
>>
>> This behavior is useful for dimms that do not support event notifications
>> in case the health status of an nvdimm changes. This is especially
>> true in case of PAPR-SCM dimms as the PHYP hyper-visor doesn't provide
>> any notifications to the guest kernel on a change in nvdimm health
>> status. In such case periodic polling of the is the only way to track
>> the health of a nvdimm.
>>
>> The patch updates monitor_event() adding a timeout value to
>> epoll_wait() call. Also to prevent the possibility of a single dimm
>> generating enough events thereby preventing check of health status of
>> other nvdimms, a 'fullpoll_ts' time-stamp is added to keep track of
>> when full health check of all dimms happened. If after epoll_wait()
>> returns 'fullpoll_ts' time-stamp indicates last full dimm health check
>> happened beyond 'check-interval' seconds then a full dimm health check
>> is enforced.
>>
>> Signed-off-by: Vaibhav Jain <[email protected]>
>> ---
>>  Documentation/ndctl/ndctl-monitor.txt |  4 ++++
>>  ndctl/monitor.c                       | 31 ++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/ndctl/ndctl-monitor.txt 
>> b/Documentation/ndctl/ndctl-monitor.txt
>> index 2239f047266d..14cc59d57157 100644
>> --- a/Documentation/ndctl/ndctl-monitor.txt
>> +++ b/Documentation/ndctl/ndctl-monitor.txt
>> @@ -108,6 +108,10 @@ will not work if "--daemon" is specified.
>>  The monitor will attempt to enable the alarm control bits for all
>>  specified events.
>>
>> +-i::
>> +--check-interval=::
>> +       Force a recheck of dimm health every <n> seconds.
>
> I'd just call this "-p --poll="  to put the monitor into polled mode.
Agree to the suggestion. Will update the flag name in V2. 

>
>> +
>>  -u::
>>  --human::
>>         Output monitor notification as human friendly json format instead
>> diff --git a/ndctl/monitor.c b/ndctl/monitor.c
>> index 1755b87a5eeb..b72c5852524e 100644
>> --- a/ndctl/monitor.c
>> +++ b/ndctl/monitor.c
>> @@ -4,6 +4,7 @@
>>  #include <stdio.h>
>>  #include <json-c/json.h>
>>  #include <libgen.h>
>> +#include <time.h>
>>  #include <dirent.h>
>>  #include <util/json.h>
>>  #include <util/filter.h>
>> @@ -33,6 +34,7 @@ static struct monitor {
>>         bool daemon;
>>         bool human;
>>         bool verbose;
>> +       unsigned int poll_timeout;
>>         unsigned int event_flags;
>>         struct log_ctx ctx;
>>  } monitor;
>> @@ -322,9 +324,14 @@ static int monitor_event(struct ndctl_ctx *ctx,
>>                 struct monitor_filter_arg *mfa)
>>  {
>>         struct epoll_event ev, *events;
>> -       int nfds, epollfd, i, rc = 0;
>> +       int nfds, epollfd, i, rc = 0, polltimeout = -1;
>>         struct monitor_dimm *mdimm;
>>         char buf;
>> +       /* last time a full poll happened */
>> +       struct timespec fullpoll_ts, ts;
>> +
>> +       if (monitor.poll_timeout)
>> +               polltimeout = monitor.poll_timeout * 1000;
>>
>>         events = calloc(mfa->num_dimm, sizeof(struct epoll_event));
>>         if (!events) {
>> @@ -354,14 +361,30 @@ static int monitor_event(struct ndctl_ctx *ctx,
>>                 }
>>         }
>>
>> +       clock_gettime(CLOCK_BOOTTIME, &fullpoll_ts);
>>         while (1) {
>>                 did_fail = 0;
>> -               nfds = epoll_wait(epollfd, events, mfa->num_dimm, -1);
>> -               if (nfds <= 0 && errno != EINTR) {
>> +               nfds = epoll_wait(epollfd, events, mfa->num_dimm, 
>> polltimeout);
>> +               if (nfds < 0 && errno != EINTR) {
>>                         err(&monitor, "epoll_wait error: (%s)\n", 
>> strerror(errno));
>>                         rc = -errno;
>>                         goto out;
>>                 }
>> +
>> +               /* If needed force a full poll of dimm health */
>> +               clock_gettime(CLOCK_BOOTTIME, &ts);
>> +               if ((fullpoll_ts.tv_sec - ts.tv_sec) > monitor.poll_timeout) 
>> {
>> +                       nfds = 0;
>> +                       dbg(&monitor, "forcing a full poll\n");
>> +               }
>
> Why is this hunk above needed? If the DIMMs are firing events then all
> of them that fired will get serviced and on any timeout all DIMMs will
> get serviced. Why does fullpoll_ts need to be tracked?
>
This hunk was added to prevent any dimm thats firing too many events
from starving the poll of events from other dimms.

>> +
>> +               /* If we timed out then fill events array with all dimms */
>> +               if (nfds == 0) {
>> +                       list_for_each(&mfa->dimms, mdimm, list)
>> +                               events[nfds++].data.ptr = mdimm;
>> +                       fullpoll_ts = ts;
>> +               }
>> +
>>                 for (i = 0; i < nfds; i++) {
>>                         mdimm = events[i].data.ptr;
>>                         if (util_dimm_event_filter(mdimm, 
>> monitor.event_flags)) {
>> @@ -570,6 +593,8 @@ int cmd_monitor(int argc, const char **argv, struct 
>> ndctl_ctx *ctx)
>>                                 "use human friendly output formats"),
>>                 OPT_BOOLEAN('v', "verbose", &monitor.verbose,
>>                                 "emit extra debug messages to log"),
>> +               OPT_UINTEGER('i', "check-interval", &monitor.poll_timeout,
>> +                            "force a dimm health recheck every <n> 
>> seconds"),
>
> I'd replace "force a dimm health check" with "poll and report status".
Fair point. Will update the this usage text in v2.
>
> The monitor can theoretically also register for region and bus events,
> so --poll option just forces the monitor to report the  all recorded
> events not necessarily only dimm events.
Agree

-- 
Vaibhav Jain <[email protected]>
Linux Technology Center, IBM India Pvt. Ltd.
_______________________________________________
Linux-nvdimm mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check

Reply via email to