Hi Stewart,

We had been running mmcrsnapshot with a ~700 node remote cluster accessing the 
filesystem for a couple of years now without issue.

However, we haven’t been running it for a little while because there is a very 
serious bug in GPFS 4.1.0.x relating to snapshot *deletion*.  There is an efix 
for it and we are in the process of rolling that out, but will not try to 
resume snapshots until both clusters are fully updated.

HTH…

Kevin

On Dec 7, 2015, at 11:23 AM, Howard, Stewart Jameson 
<[email protected]<mailto:[email protected]>> wrote:

Hi All,

Thanks to Doug and Kevin for the replies.  In answer to Kevin's question about 
our choice of clustering solution for NFS:  the choice was made hoping to 
maintain some simplicity by not using more than one HA solution at a time.  
However, it seems that this choice might have introduced more wrinkles than 
it's ironed out.

An update on our situation:  we have actually uncovered another clue since my 
last posting.  One thing that this now known to be correlated *very* closely 
with instability in the NFS layer is running `mmcrsnapshot`.    We had noticed 
that flapping happened like clockwork at midnight every night.  This happens to 
be the same time at which our crontab was running the `mmcrsnapshot` so, as an 
experiment, we moved the snapshot to happen at 1a.

After this change, the late-night flapping has moved to 1a and now happens 
reliably every night at that time.  I saw a post on this list from 2013 stating 
that `mmcrsnapshot` was known to hang up the filesystem with race conditions 
that result in deadlocks and am wondering if that is still a problem with the 
`mmcrsnapthost` command.  Running the snapshots had not been an obvious problem 
before, but seems to have become one since we deployed ~300 additional GPFS 
clients in a remote cluster configuration about a week ago.

Can anybody comment on the safety of running `mmcrsnapshot` with a ~300 node 
remote cluster accessing the filesystem?

Also, I would comment that this is not the only condition under which we see 
instability in the NFS layer.  We continue to see intermittent instability 
through the day.  The creation of a snapshot is simply the one well-correlated 
condition that we've discovered so far.

Thanks so much to everyone for your help  :)

Stewart
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to