Re: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory

Aaron Knister Wed, 07 Mar 2018 02:25:59 -0800

Following up on this...

On one of the nodes on which I'd bounced the pagepool around I managedto cause what appeared to that node as filesystem corruption (i/o errorsand fsstruct errors) on every single fs. Thankfully none of the othernodes in the cluster seemed to agree that the fs was corrupt. I'll opena PMR on that but I thought it was interesting none the less. I haven'trun an fsck on any of the filesystems but my belief is that they're OKsince so far none of the other nodes in the cluster have complained.

Secondly, I can see the pagepool allocations that align with registeredverbs mr's (looking at mmfsadm dump verbs). In theory one can free an ibmr after registration as long as it's not in use but one has to trackthat and I could see that being a tricky thing (although in theory giventhe fact that GPFS has its own page allocator it might be relativelytrivial to figure it out but it might also require re-establishing RDMAconnections depending on whether or not a given QP is associated with aPD that uses the MR trying to be freed...I think that makes sense).

Anyway, I'm wondering if the need to free the ib MR on pagepool shrinkcould be avoided all together by limiting the amount of memory that getsallocated to verbs MR's (e.g. something like verbsPagePoolMaxMB) so thatthose regions never need to be freed but the amount of memory availablefor user caching could grow and shrink as required. It's probably notthat simple, though :)

Another thought I had was doing something like creating a file in/dev/shm, registering it as a loopback device, and using that as an LROCdevice. I just don't think that's feasible at scale given the currentmethod of LROC device registration (e.g. via the mmsdrfs file).

I think there's much to be gained from the ability to dynamically changethe memory-based file cache size on a per-job basis so I'm reallyhopeful we can find a way to make this work.


-Aaron

On 2/25/18 11:45 AM, Aaron Knister wrote:

Hmm...interesting. It sure seems to try :)

The pmap command was this:

pmap $(pidof mmfsd) | sort -n -k3 | tail

-Aaron

On 2/23/18 9:35 AM, IBM Spectrum Scale wrote:

AFAIK you can increase the pagepool size dynamically but you cannotshrink it dynamically. To shrink it you must restart the GPFS daemon.Also, could you please provide the actual pmap commands you executed?


Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------If you feel that your question can benefit other users of SpectrumScale (GPFS), then please post it to the public IBM developerWroksForum athttps://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.

If your query concerns a potential software error in Spectrum Scale(GPFS) and you have an IBM software maintenance contract pleasecontact 1-800-237-5511 in the United States or your local IBMService Center in other countries.

The forum is informally monitored as time permits and should not beused for priority messages to the Spectrum Scale (GPFS) team.




From: Aaron Knister <[email protected]>
To: <[email protected]>
Date: 02/22/2018 10:30 PM
Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory
Sent by: [email protected]
------------------------------------------------------------------------



This is also interesting (although I don't know what it really means).
Looking at pmap run against mmfsd I can see what happens after each step:

# baseline
00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
0000020000000000 1048576K 1048576K 1048576K 1048576K      0K rwxp [anon]
Total:           1613580K 1191020K 1189650K 1171836K      0K

# tschpool 64G
00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
0000020000000000 67108864K 67108864K 67108864K 67108864K  0K rwxp [anon]
Total:           67706636K 67284108K 67282625K 67264920K      0K

# tschpool 1G
00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
0000020001400000 139264K 139264K 139264K 139264K      0K rwxp [anon]
0000020fc9400000 897024K 897024K 897024K 897024K      0K rwxp [anon]
0000020009c00000 66052096K      0K      0K      0K      0K rwxp [anon]
Total:           67706636K 1223820K 1222451K 1204632K      0K

Even though mmfsd has that 64G chunk allocated there's none of it
*used*. I wonder why Linux seems to be accounting it as allocated.

-Aaron

On 2/22/18 10:17 PM, Aaron Knister wrote:

> I've been exploring the idea for a while of writing a SLURM SPANKplugin > to allow users to dynamically change the pagepool size on a node.Every

 > now and then we have some users who would benefit significantly from a
 > much larger pagepool on compute nodes but by default keep it on the

> smaller side to make as much physmem available as possible to batchwork.

 >
 > In testing, though, it seems as though reducing the pagepool doesn't
 > quite release all of the memory. I don't really understand it because
 > I've never before seen memory that was previously resident become
 > un-resident but still maintain the virtual memory allocation.
 >
 > Here's what I mean. Let's take a node with 128G and a 1G pagepool.
 >
 > If I do the following to simulate what might happen as various jobs
 > tweak the pagepool:
 >
 > - tschpool 64G
 > - tschpool 1G
 > - tschpool 32G
 > - tschpool 1G
 > - tschpool 32G
 >
 > I end up with this:
 >
 > mmfsd thinks there's 32G resident but 64G virt
 > # ps -o vsz,rss,comm -p 24397
 >     VSZ   RSS COMMAND
 > 67589400 33723236 mmfsd
 >
 > however, linux thinks there's ~100G used
 >
 > # free -g
 > total       used free     shared    buffers cached
 > Mem:           125 100         25 0          0 0
 > -/+ buffers/cache: 98         26
 > Swap: 7          0 7
 >
 > I can jump back and forth between 1G and 32G *after* allocating 64G
 > pagepool and the overall amount of memory in use doesn't balloon but I
 > can't seem to shed that original 64G.
 >
 > I don't understand what's going on... :) Any ideas? This is with Scale
 > 4.2.3.6.
 >
 > -Aaron
 >

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=OrZQeEmI6chBdguG-h4YPHsxXZ4gTU3CtIuN4e3ijdY&s=hvVIRG5kB1zom2Iql2_TOagchsgl99juKiZfJt5S1tM&e=







_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory

Reply via email to