Andreas
The file system has lru_max_age=9000000. I have been googling around to
find out what this controls, but haven't found much. Is there
documentation on how the memory management works with Lustre? I wonder
what the lru actually means. How is it that 2 files on the same node
are not controlled by the same lru mechanism, as SCR300's pages are
being lru'ed out when they are clearly used more recently than any in
SCRATCH?
Thanks
John
On 12/12/2016 6:59 PM, Dilger, Andreas wrote:
On Dec 12, 2016, at 15:50, John Bauer <[email protected]> wrote:
I'm observing some undesirable caching of OSC data in the system buffers. This
is a single node, single process application. There are 2 files of interest,
SCRATCH and SCR300, both are scratch files with stripeCount=4. The system has
128GB of memory. Lustre maxes out at about 59GB of memory used for caching.
SCRATCH, About 22GB is written/read during the first 300 seconds of the run.
No further activity to the file ( but remains open ) until about 18,700 seconds
into the run when another 22GB is written/read. Illustrated in the top frame
of the first plot below. In the bottom frame of the first plot is the amount
of system cache used by each of the 4 OSC's associated with the file over the
course of the run ( nearly identical, as would be expected ). Note that each
the OSC's retains its 5.5GB of memory even though nothing is happening to the
file.
SCR300, A 110GB file, written and repeatedly read between the times of the
above SCRATCH file's I/O.
What is of interest it that while SCR300 is doing all its I/O, and its
associated OSC's are fighting each other for caching memory, the 4 OSC's for
the inactive file(SCRATCH) retain their 22GB of memory. Why are the 4 OSC's
for the inactive file exempt from giving up their memory? It is very
reproducible.
You don't mention what Lustre version you are using, which makes it hard
to comment specifically. That said, you could try reducing the lock LRU
age, which was changed by default in the 2.8 or 2.9 release to 3900s
(65 minutes) instead of 36000s (10h) via:
lctl set_param ldlm.namespaces.*.lru_max_age=3900000
(though check what your current setting is, since the units are in
"jiffies" (HZ) and that may differ depending on kernel compile options).
Cheers, Andreas
The application is MSC.Nastran, which has the capability to put the data for
SCR300 inside of SCRATCH(increasing its size to 132GB). If run in this mode,
the caching behavior is much better behaved and the job runs in 11,500 seconds,
versus 19,000. Illustrated in 3rd plot below. While this is a solution for
this case, it is not a general solution.
Thanks
John
Plots for SCRATCH
<bfoimgfaenjmgmii.png>
Plots for SCR300
<mncccijbfkiekmmn.png>
Plots for SCR300 inside of SCRATCH
<adnondhpelpohhjf.png>
--
I/O Doctors, LLC
507-766-0378
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
I/O Doctors, LLC
507-766-0378
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org