AFM maintains in-memory queue at the gateway node to keep track of changes happening on the fileset. If the in-memory queue is lost (memory pressure, daemon shutdown etc..), AFM runs recovery process which involves creating the snapshot, running the policy scan and finally queueing the recovered operations. Due to message (operations) dependency, any changes to the AFM fileset during the recovery won't get replicated until the recovery the completion. AFM does the home directory scan for only dirty directories to get the names of the deleted and renamed files because old name for renamed file and deleted file name are not available at the cache on disk. Directories are made dirty when there is a rename or unlink operation is performed inside it. In your case it may be that all the directories became dirty due to the rename/unlink operations. AFM recovery process is single threaded.
>Is this to be expected and normal behavior? What to do about it? >Will every reboot of a gateway node trigger a recovery of all afm filesets and a full scan of home? This would make normal rolling updates very unpractical, or is there some better way? Only for the dirty directories, see above. >Home is a gpfs cluster, hence we easily could produce the needed filelist on home with a policyscan in a few minutes. There is some work going on to preserve the file names of the unlinked/renamed files in the cache until they get replicated to home so that home directory scan can be avoided. These are some issues fixed in this regard. What is the scale version ? https://www-01.ibm.com/support/docview.wss?uid=isg1IJ15436 ~Venkat ([email protected]) From: "Billich Heinrich Rainer (ID SD)" <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 01/08/2020 10:32 PM Subject: [EXTERNAL] [gpfsug-discuss] AFM Recovery of SW cache does a full scan of home - is this to be expected? Sent by: [email protected] Hello, still new to AFM, so some basic question on how Recovery works for a SW cache: we have an AFM SW cache in recovery mode – recovery first did run policies on the cache cluster, but now I see a ‘tcpcachescan’ process on cache slowly scanning home via nfs. Single host, single process, no parallelism as far as I can see, but I may be wrong. This scan of home on a cache afmgateway takes very long while further updates on cache queue up. Home has about 100M files. After 8hours I see about 70M entries in the file /var/mmfs/afm/…/recovery/homelist, i.e. we get about 2500 lines/s. (We may have very many changes on cache due to some recursive ACL operations, but I’m not sure.) So I expect that 12hours pass to buildup filelists before recovery starts to update home. I see some risk: In this time new changes pile up on cache. Memory may become an issue? Cache may fill up and we can’t evict? I wonder Is this to be expected and normal behavior? What to do about it? Will every reboot of a gateway node trigger a recovery of all afm filesets and a full scan of home? This would make normal rolling updates very unpractical, or is there some better way? Home is a gpfs cluster, hence we easily could produce the needed filelist on home with a policyscan in a few minutes. Thank you, I will welcome and clarification, advice or comments. Kind regards, Heiner . -- ======================= Heinrich Billich ETH Zürich Informatikdienste Tel.: +41 44 632 72 56 [email protected] ======================== _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=WwGGO3WlGLmgMZX-tb_xjLEk0paAJ_Tekt6NNrxJgPM&s=_oss6YKaJwm5PEi1xqqpwxOstqR0Pqw6hdhOwZ3gsAw&e=
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
