Hi, We had to use lustre-2.5.3.90 on the MDS servers because of memory leak.
https://jira.hpdd.intel.com/browse/LU-5726 I'm not sure if it's related but worthwhile to check. BR, Tommi ----- Original Message ----- From: Mark Hahn <h...@mcmaster.ca> To: lustre-discuss@lists.lustre.org Sent: Tuesday, April 12, 2016 11:49 PM Subject: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70) One of our MDSs is crashing with the following: BUG: unable to handle kernel paging request at 00000000deadbeef IP: [<ffffffffa0ce0328>] iam_container_init+0x18/0x70 [osd_ldiskfs] PGD 0 Oops: 0002 [#1] SMP The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64 with about 2k clients ranging from 1.8.8 to 2.6.0 I'd appreciate any comments on where to point fingers: google doesn't provide anything suggestive about iam_container_init. Our problem seems to correlate with an unintentional creation of a tree of >500M files. Some of the crashes we've had since then appeared to be related to vm.zone_reclaim_mode=1. We also enabled quotas right after the 500M file thing, and were thinking that inconsistent quota records might cause this sort of crash. But 0xdeadbeef is usually added as a canary for allocation issues; is it used this way in Lustre? thanks, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS | h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada | http://www.computecanada.ca _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org