My environment is "heterogeneous" my authentication and home server are currently stuck on a 1G shared network, the production servers and storage servers are on a bonded 40G network, all are in the same VLAN. I have about 100 servers on the 40GB bonded network each with 12cores and 128GB of memory.

They are running centos 6.6

Except for my storage servers they are all just running large and small research jobs on a grid engine.


Two questions:

The errors she seems to spawn is

lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)

and at some point, we start getting errors that the file locks are stuck.. you can write and read from the lockfile, but programs that depend on the C construct lock file throw filelock errors until we reboot.



Why is dmesg, /var/log/dmesg, and /var/log/messages  unique from each other?
I thought dmesg was a representation of /var/log/messages/


Is there a way to get a date stamp for the dmesg? if a job failed in the last hour and the message is from yesterday...and I don't know that doesn't help.


I think what I am troubleshooting is THAT user who REFUSES to follow direction... and is sending thousands of very large jobs which each might immediately spawn another 10-20 jobs to a grid of 100 servers in a matter of seconds overwhelming either the network or the home directory server or the authentication server... because when she strikes, sometimes users cannot get a response from LDAP or the home server within as much as 10 seconds. Thus she breaks the NFS because it gets hammered and I have to restart all the servers on my grid.

We have had problems with "out of memory errors" due to her programs in the recent past and had to restart all 100 servers.


*/var/adm/messages gives this*

Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_result() failed: Can't contact LDAP server Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_abandon() failed to abandon search: Other (e.g., implementation specific) error Oct 18 13:27:14 blade5-2-1 nslcd[2520]: [e01acb] ldap_result() failed: Can't contact LDAP server Oct 18 13:27:30 blade5-2-1 nslcd[2520]: [8c7a8f] ldap_result() failed: Can't contact LDAP server


*dmesg gives these*

lockd: server home not responding, still trying
lockd: server home OK

lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)

*/var/log/dmesg gives this*

pmi_si: probing via SMBIOS
ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 10
ipmi_si: Adding SMBIOS-specified kcs state machine
ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca8, slave address 0x20, irq 10 (NULL device *): The BMC does not support setting the recv irq bit, compensating, but the BMC needs to be fixed.
IRQ 10/ipmi_si: IRQF_DISABLED is not guaranteed on shared IRQs
ipmi_si ipmi_si.0: Using irq 10
ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
ipmi_si ipmi_si.0: IPMI kcs interface initialized
ACPI: No handler for Region [SYSI] (ffff882029e57348) [IPMI]
power_meter ACPI000D:00: Found ACPI power meter.
ipmi device interface
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts:
Adding 121724924k swap on /dev/mapper/vg_server-lv_swap. Priority:-1 extents:1 across:121724924

_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to