At 2:52 PM -0600 4/4/07, Andreas Dilger wrote:
On Apr 04, 2007  11:57 -0800, Jan H. Julian wrote:
 These are client nodes and in fact, this class of node is running a
 particular application that intermittently fails leaving a lustre
 error in syslog.

 "mg38 kernel: LustreError:
 10147:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue
 objid 0x3922667 subobj 0x15dfc on OST idx 1: rc = -4"

This means that the enqueue was interrupted (-4 = -EINTR in
/usr/include/asm/errno.h).  That shouldn't happen unless the job was
waiting a long time already (at least 100s, and then it was killed).

What does /proc/slabinfo show for lustre allocated memory in the slab
cache (most items are "ll_*")?


For mg38 at this time I see:

mg37
ll_async_page 256 559 288 13 1 : tunables 54 27 8 : slabdata 43 43 0 ll_file_data 16 120 128 30 1 : tunables 120 60 8 : slabdata 4 4 0 ll_import_cache 0 0 360 11 1 : tunables 54 27 8 : slabdata 0 0 0 ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4 0 : slabdata 38 38 0 eventpoll_pwq 33 53 72 53 1 : tunables 120 60 8 : slabdata 1 1 0 eventpoll_epi 30 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0

mg38
ll_async_page 256 754 288 13 1 : tunables 54 27 8 : slabdata 58 58 0 ll_file_data 16 90 128 30 1 : tunables 120 60 8 : slabdata 3 3 0 ll_import_cache 0 0 360 11 1 : tunables 54 27 8 : slabdata 0 0 0 ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4 0 : slabdata 38 38 0 eventpoll_pwq 49 53 72 53 1 : tunables 120 60 8 : slabdata 1 1 0 eventpoll_epi 45 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0

mg39
ll_async_page 256 637 288 13 1 : tunables 54 27 8 : slabdata 49 49 0 ll_file_data 16 90 128 30 1 : tunables 120 60 8 : slabdata 3 3 0 ll_import_cache 0 0 360 11 1 : tunables 54 27 8 : slabdata 0 0 0 ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4 0 : slabdata 38 38 0 eventpoll_pwq 49 53 72 53 1 : tunables 120 60 8 : slabdata 1 1 0 eventpoll_epi 45 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0



 While mg38 is currently showing a negative valued for memused, I have
 not been able to tie that to a failure.  The error message points to
 the same file

 At 1:34 PM -0600 4/4/07, Andreas Dilger wrote:
 >On Apr 04, 2007  11:28 -0800, Jan H. Julian wrote:
 >> Could someone please clarify the use of the proc values for
 >> subsystem_debug and memused.  In regard to
 >> /proc/sys/portals/subsystem_debug and /proc/sys/portals/debug, should
 >> both be set to zero to totally turn of debugging?
 >>
 >> In regard to /proc/sys/lustre/memused we see quite a variety of
 >> entries included many with negative values.  Does the negative value
 >> have a particular meaning?
 >> For instance "cat /proc/sys/lustre/memused" for 9 nodes shows:
 >> ...
 >> mg07  102186899
 >> mg08  101775995
 >> mg09  -1323553489
 >> mg10  -1328553965
 > >> mg11  -1378379739
 >> mg12  -1347059989
 >> mg13  -1364717487
 >> mg14  -1358477913
 >> mg15  24680370
 >>
 >> These are 16 core  machines with 64GB of resident memory.
 >
 >This appears to be an overflow of a 32-bit counter.  It isn't strictly
 >harmful, because it will underflow an equal amount later on and should
 >return to zero when Lustre unmounts.  It does make this stat less useful
 >on machines with lots of RAM.
 >
 >Are these client or server nodes?  I'm a bit surprised that Lustre would
 >be allocating > 2GB of RAM.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


--
Jan Julian     University of Alaska, ARSC    mailto:[EMAIL PROTECTED]
(907) 450-8641  910 Yukon Drive, Suite 001    http://www.arsc.edu
Fax: 450-8605  Fairbanks, AK 99775-6020 USA

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to