Re: [Beowulf] big read triggers migration and slow memory IO?

Steffen Persvold Thu, 09 Jul 2015 12:30:12 -0700

Hi David,

Have you tried setting /proc/sys/vm/zone_reclaim_mode to 3 or 7 ?


Cheers,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88  Fax: +47 23 16 71 80 Skype: spersvold

> On 09 Jul 2015, at 20:44, mathog <[email protected]> wrote:
> 
> Reran the generators and that did make the system slow again, so at least 
> this problem can be reproduced.
> 
> After those ran memory is definitely in short supply, pretty much everything 
> is in file cache.  For whatever reason, the system seems to be loathe to 
> release memory from file cache for other uses.  I think that is the problem.
> 
> Here is some data, this is a bit long...
> 
> numactl --hardware ho
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 
> 46
> node 0 size: 262098 MB
> node 0 free: 18372 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 
> 47
> node 1 size: 262144 MB
> node 1 free: 2829 MB
> node distances:
> node   0   1
>  0:  10  20
>  1:  20  10
> 
> CPU specific tests were done on 20, so NUMA node 0.  None of the tests come 
> close to using up all the physical memory in a "node", which is 262GB.
> 
> When cache has been cleared, and the test programs run fast:
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:        525988868 kB
> Buffers:            5428 kB
> Cached:            46544 kB
> SwapCached:          556 kB
> Active:            62220 kB
> Inactive:         121316 kB
> Active(anon):      26596 kB
> Inactive(anon):   109456 kB
> Active(file):      35624 kB
> Inactive(file):    11860 kB
> 
> run one test and it jumps up to
> 
> MemTotal:       529231456 kB
> MemFree:        491812500 kB
> Buffers:           10644 kB
> Cached:         34139976 kB
> SwapCached:          556 kB
> Active:         34152592 kB
> Inactive:         130400 kB
> Active(anon):      27560 kB
> Inactive(anon):   109316 kB
> Active(file):   34125032 kB
> Inactive(file):    21084 kB
> 
> and the next test is still quick.  After running the generators, but when 
> nothing much is running, it starts like this:
> 
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:        19606616 kB
> Buffers:           46704 kB
> Cached:         493107268 kB
> SwapCached:          556 kB
> Active:         34229020 kB
> Inactive:       459056372 kB
> Active(anon):        712 kB
> Inactive(anon):   135508 kB
> Active(file):   34228308 kB
> Inactive(file): 458920864 kB
> 
> Then when a test job is run it drops quickly to this and sticks. Note the 
> MemFree value.  I think this is where the "Events/20" process kicks in:
> 
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:          691740 kB
> Buffers:           46768 kB
> Cached:         493056968 kB
> SwapCached:          556 kB
> Active:         53164328 kB
> Inactive:       459006232 kB
> Active(anon):   18936048 kB
> Inactive(anon):   135608 kB
> Active(file):   34228280 kB
> Inactive(file): 458870624 kB
> 
> Kill the process and the system "recovers" to the preceding memory 
> configuration in a few seconds.  Similarly /proc/zoneinfo values from before 
> the generators were run, when the system was fast:
> 
> extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10  -ifonly
> Node 0, zone      DMA
>  pages free     3931
>        min      0
>        low      0
>        high     0
>        scanned  0
>        spanned  4095
>        present  3832
>    nr_free_pages 3931
>    nr_inactive_anon 0
>    nr_active_anon 0
> Node 0, zone    DMA32
>  pages free     105973
>        min      139
>        low      173
>        high     208
>        scanned  0
>        spanned  1044480
>        present  822056
>    nr_free_pages 105973
>    nr_inactive_anon 0
>    nr_active_anon 0
> Node 0, zone   Normal
>  pages free     50199731
>        min      11122
>        low      13902
>        high     16683
>        scanned  0
>        spanned  66256896
>        present  65351040
>    nr_free_pages 50199731
>    nr_inactive_anon 16490
>    nr_active_anon 7191
> Node 1, zone   Normal
>  pages free     57596396
>        min      11265
>        low      14081
>        high     16897
>        scanned  0
>        spanned  67108864
>        present  66191360
>    nr_free_pages 57596396
>    nr_inactive_anon 10839
>    nr_active_anon 1772
> 
> and after the generators were run (slow):
> 
> Node 0, zone      DMA
>  pages free     3931
>        min      0
>        low      0
>        high     0
>        scanned  0
>        spanned  4095
>        present  3832
>    nr_free_pages 3931
>    nr_inactive_anon 0
>    nr_active_anon 0
> Node 0, zone    DMA32
>  pages free     105973
>        min      139
>        low      173
>        high     208
>        scanned  0
>        spanned  1044480
>        present  822056
>    nr_free_pages 105973
>    nr_inactive_anon 0
>    nr_active_anon 0
> Node 0, zone   Normal
>  pages free     23045
>        min      11122
>        low      13902
>        high     16683
>        scanned  0
>        spanned  66256896
>        present  65351040
>    nr_free_pages 23045
>    nr_inactive_anon 16486
>    nr_active_anon 5839
> Node 1, zone   Normal
>  pages free     33726
>        min      11265
>        low      14081
>        high     16897
>        scanned  0
>        spanned  67108864
>        present  66191360
>    nr_free_pages 33726
>    nr_inactive_anon 10836
>    nr_active_anon 1065
> 
> Looking the same way at /proc/zoninfo while a test is running showed
> the "pages free" and "nr_free_pages" values oscillating downward to
> a low of about 28000 for Node 0, zone Normal.  The rest of the values were 
> essentially stable.
> 
> Looking the same way at /proc/meminfo while a test is running gave values 
> that differed in only minor ways from the "after" table shown above.  MemFree 
> varied in a range from abut 680000 to 720000.
> Cached dropped to ~482407184 kB and then budged barely at all.
> 
> Finally the last few lines from "sar -B" (sorry about the wrap)
> 
> 10:30:03 AM   5810.55 301475.26     95.99      0.05  51710.29  48086.79      
> 0.00  48084.94    100.00
> 10:40:01 AM   3404.90 185502.87     96.67      0.01  47267.84  44816.30      
> 0.00  44816.30    100.00
> 10:50:02 AM      9.13     13.32    192.24      0.11   4592.56     48.54   
> 3149.01   3197.55    100.00
> 11:00:01 AM    191.78      9.97    347.56      0.13  16760.51      0.00   
> 3683.21   3683.21    100.00
> 
> 11:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
> pgscand/s pgsteal/s    %vmeff
> 11:10:01 AM     11.64      7.75    342.59      0.09  18528.24      0.00   
> 1699.66   1699.66    100.00
> 11:20:01 AM      0.00      6.75     96.87      0.00     43.97      0.00      
> 0.00      0.00      0.00
> 
> The generators finished at 10:35.  The data point at 10:30 (while they were 
> running)  pgscank/s and pgsteal/s jumped up from 0 to high values.  When 
> later tests were run the former fell down to not much but the latter stayed 
> high.  Additionally when the test runs were made following the generator it 
> pushed pgscand/s from 0 to several thousand per second.  The last row 
> consists of a 10 minute span where no tests were run, and these values all 
> dropped back to zero.
> 
> Since excessive file cache seems to implicated did this:
> echo 3 > /proc/sys/vm/drop_caches
> 
> and reran the test on node 20.  It was fast.
> 
> I guess the question now is what parameter(s) control(s) the conversion from 
> memory in file cache to memory needed for other purposes when free memory is 
> in short supply and there is substantial demand.  It seems the OS isn't 
> releasing cache.  Or maybe it isn't flushing it to disk.  I don't think it's 
> the latter because iotop and iostat don't show any activity during a "slow" 
> read.
> 
> Thank,
> 
> David Mathog
> [email protected]
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, [email protected] sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] big read triggers migration and slow memory IO?

Reply via email to