Hi Mark, On Tue, Apr 12, 2016 at 04:49:10PM -0400, Mark Hahn wrote: >One of our MDSs is crashing with the following: > >BUG: unable to handle kernel paging request at 00000000deadbeef >IP: [<ffffffffa0ce0328>] iam_container_init+0x18/0x70 [osd_ldiskfs] >PGD 0 >Oops: 0002 [#1] SMP > >The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64 >with about 2k clients ranging from 1.8.8 to 2.6.0
I saw an identical crash in Sep 2014 when the MDS was put under memory pressure. >to be related to vm.zone_reclaim_mode=1. We also enabled quotas zone_reclaim_mode should always be 0. 1 is broken. hung processes perpetually 'scanning' in one zone in /proc/zoneinfo whilst plenty of pages are free in another zone is a sure sign of this issue. however if you have vm.zone_reclaim_mode=0 now and are still seeing the issue, then I would suspect that lustre's overly agresssive memory affinity code is partially to blame. at the very least it is most likely stopping you from making use of half your MDS ram. see https://jira.hpdd.intel.com/browse/LU-5050 set options libcfs cpu_npartitions=1 to fix it. that's what I use on OSS and MDS for all my clusters. cheers, robin _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org