At 2:52 PM -0600 4/4/07, Andreas Dilger wrote:
On Apr 04, 2007 11:57 -0800, Jan H. Julian wrote:
These are client nodes and in fact, this class of node is running a
particular application that intermittently fails leaving a lustre
error in syslog.
"mg38 kernel: LustreError:
10147:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue
objid 0x3922667 subobj 0x15dfc on OST idx 1: rc = -4"
This means that the enqueue was interrupted (-4 = -EINTR in
/usr/include/asm/errno.h). That shouldn't happen unless the job was
waiting a long time already (at least 100s, and then it was killed).
What does /proc/slabinfo show for lustre allocated memory in the slab
cache (most items are "ll_*")?
For mg38 at this time I see:
mg37
ll_async_page 256 559 288 13 1 : tunables 54 27
8 : slabdata 43 43 0
ll_file_data 16 120 128 30 1 : tunables 120 60
8 : slabdata 4 4 0
ll_import_cache 0 0 360 11 1 : tunables 54 27
8 : slabdata 0 0 0
ll_obdo_cache 0 0 208 19 1 : tunables 120 60
8 : slabdata 0 0 0
ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4
0 : slabdata 38 38 0
eventpoll_pwq 33 53 72 53 1 : tunables 120 60
8 : slabdata 1 1 0
eventpoll_epi 30 30 256 15 1 : tunables 120 60
8 : slabdata 2 2 0
mg38
ll_async_page 256 754 288 13 1 : tunables 54 27
8 : slabdata 58 58 0
ll_file_data 16 90 128 30 1 : tunables 120 60
8 : slabdata 3 3 0
ll_import_cache 0 0 360 11 1 : tunables 54 27
8 : slabdata 0 0 0
ll_obdo_cache 0 0 208 19 1 : tunables 120 60
8 : slabdata 0 0 0
ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4
0 : slabdata 38 38 0
eventpoll_pwq 49 53 72 53 1 : tunables 120 60
8 : slabdata 1 1 0
eventpoll_epi 45 45 256 15 1 : tunables 120 60
8 : slabdata 3 3 0
mg39
ll_async_page 256 637 288 13 1 : tunables 54 27
8 : slabdata 49 49 0
ll_file_data 16 90 128 30 1 : tunables 120 60
8 : slabdata 3 3 0
ll_import_cache 0 0 360 11 1 : tunables 54 27
8 : slabdata 0 0 0
ll_obdo_cache 0 0 208 19 1 : tunables 120 60
8 : slabdata 0 0 0
ll_obd_dev_cache 38 38 5120 1 2 : tunables 8 4
0 : slabdata 38 38 0
eventpoll_pwq 49 53 72 53 1 : tunables 120 60
8 : slabdata 1 1 0
eventpoll_epi 45 45 256 15 1 : tunables 120 60
8 : slabdata 3 3 0
While mg38 is currently showing a negative valued for memused, I have
not been able to tie that to a failure. The error message points to
the same file
At 1:34 PM -0600 4/4/07, Andreas Dilger wrote:
>On Apr 04, 2007 11:28 -0800, Jan H. Julian wrote:
>> Could someone please clarify the use of the proc values for
>> subsystem_debug and memused. In regard to
>> /proc/sys/portals/subsystem_debug and /proc/sys/portals/debug, should
>> both be set to zero to totally turn of debugging?
>>
>> In regard to /proc/sys/lustre/memused we see quite a variety of
>> entries included many with negative values. Does the negative value
>> have a particular meaning?
>> For instance "cat /proc/sys/lustre/memused" for 9 nodes shows:
>> ...
>> mg07 102186899
>> mg08 101775995
>> mg09 -1323553489
>> mg10 -1328553965
> >> mg11 -1378379739
>> mg12 -1347059989
>> mg13 -1364717487
>> mg14 -1358477913
>> mg15 24680370
>>
>> These are 16 core machines with 64GB of resident memory.
>
>This appears to be an overflow of a 32-bit counter. It isn't strictly
>harmful, because it will underflow an equal amount later on and should
>return to zero when Lustre unmounts. It does make this stat less useful
>on machines with lots of RAM.
>
>Are these client or server nodes? I'm a bit surprised that Lustre would
>be allocating > 2GB of RAM.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
--
Jan Julian University of Alaska, ARSC mailto:[EMAIL PROTECTED]
(907) 450-8641 910 Yukon Drive, Suite 001 http://www.arsc.edu
Fax: 450-8605 Fairbanks, AK 99775-6020 USA
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss