Hi Brian,

I have tried to umount the OST(idx4) and the server LBUGed

I attache LBUG below.

Jan 22 17:32:30 storage08 kernel: Lustre: Failing over ddn_data-OST0004
Jan 22 17:32:30 storage08 kernel: Lustre: Skipped 1 previous similar message
Jan 22 17:32:30 storage08 kernel: Lustre: *** setting obd ddn_data-OST0004 device 'unknown-block(253,8)' read-only ***
Jan 22 17:32:30 storage08 kernel: Lustre: Skipped 1 previous similar message
Jan 22 17:32:30 storage08 kernel: Turning device dm-8 (0xfd00008) read-only
Jan 22 17:32:30 storage08 kernel: Lustre: ddn_data-OST0004: shutting down for failover; client state will be preserved.
Jan 22 17:32:30 storage08 kernel: LustreError: 8667:0:(lprocfs_status.c:779:lprocfs_free_client_stats()) ASSERTION(client_stat->nid_exp_ref_count == 0) failed:count 1
Jan 22 17:32:30 storage08 kernel: LustreError: 8667:0:(lprocfs_status.c:779:lprocfs_free_client_stats()) LBUG
Jan 22 17:32:30 storage08 kernel: Lustre: 8667:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 8667
Jan 22 17:32:30 storage08 kernel: Lustre: 8667:0:(linux-debug.c:185:libcfs_debug_dumpstack()) Skipped 14 previous similar messages
Jan 22 17:32:30 storage08 kernel: umount        R  running task       0  8667   8581                     (NOTLB)
Jan 22 17:32:30 storage08 kernel: 0000000000000246 000001025f0598c0 0000000000000002 ffffffff80462cc0
Jan 22 17:32:30 storage08 kernel:        ffffffff8045ecc0 ffffffff8013817a 0000003000000010 000001025f0599c8
Jan 22 17:32:30 storage08 kernel:        000001025f059908 000001025f0599d8
Jan 22 17:32:30 storage08 kernel: Call Trace:<ffffffff8016132d>{free_block+285} <ffffffff801490eb>{__kernel_text_address+26}
Jan 22 17:32:30 storage08 kernel:        <ffffffff801116f4>{show_trace+375} <ffffffff80111830>{show_stack+241}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa065ab73>{:libcfs:lbug_with_loc+115} <ffffffffa06f1084>{:obdclass:lprocfs_free_client_stats+228}
Jan 22 17:32:30 storage08 kernel:        <ffffffff801af5d5>{remove_proc_entry+443} <ffffffffa06f1381>{:obdclass:lprocfs_free_per_client_stats+145}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa09fc4bb>{:obdfilter:filter_cleanup+539} <ffffffffa06f8f66>{:obdclass:class_decref+1526}
Jan 22 17:32:30 storage08 kernel:        <ffffffff801339ef>{__wake_up+54} <ffffffffa06e9d7a>{:obdclass:obd_zombie_impexp_cull+154}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa06f938d>{:obdclass:class_detach+701} <ffffffffa06fdf01>{:obdclass:class_process_config+6033}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa07022d4>{:obdclass:class_manual_cleanup+2676}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa0704f69>{:obdclass:lustre_end_log+1353} <ffffffffa06e520a>{:obdclass:class_name2dev+154}
Jan 22 17:32:30 storage08 kernel:        <ffffffffa070be50>{:obdclass:server_put_super+1232}
Jan 22 17:32:30 storage08 kernel:        <ffffffff80191c0f>{invalidate_list+49} <ffffffff80192c24>{dispose_list+192}
Jan 22 17:32:30 storage08 kernel:        <ffffffff80192f62>{invalidate_inodes+185} <ffffffff8017e641>{generic_shutdown_super+198}
Jan 22 17:32:30 storage08 kernel:        <ffffffff8017f2a0>{kill_anon_super+9} <ffffffff8017e562>{deactivate_super+95}
Jan 22 17:32:31 storage08 kernel:        <ffffffff801951a7>{sys_umount+965} <ffffffff80181bd8>{sys_newstat+17}
Jan 22 17:32:31 storage08 kernel:        <ffffffff80110d5d>{error_exit+0} <ffffffff80110236>{system_call+126}
Jan 22 17:32:31 storage08 kernel:       
Jan 22 17:32:31 storage08 kernel: LustreError: 137-5: UUID 'ddn_data-OST0004_UUID' is not available  for connect (stopping)
Jan 22 17:32:31 storage08 kernel: LustreError: Skipped 35 previous similar messages
Jan 22 17:32:31 storage08 kernel: LustreError: dumping log to /tmp/lustre-log.1232645550.8667
Jan 22 17:33:29 storage08 kernel: LustreError: 23499:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  r...@00000103d2408c00 x1609/t0 o8-><?>@<?>:0/0 lens 240/0 e 0 to 0 dl 1232646009 ref 1 fl Interpret:/0/0 rc -19/0
Jan 22 17:33:29 storage08 kernel: LustreError: 23499:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1278 previous similar messages
Jan 22 17:33:48 storage08 kernel: LustreError: 137-5: UUID 'ddn_data-OST0004_UUID' is not available  for connect (stopping)
Jan 22 17:33:48 storage08 kernel: LustreError: Skipped 578 previous similar messages

Any clue if it is related to the read-only state of the OST0004?

Regards,

Wojciech

Wojciech Turek wrote:
Hi Brian,

Brian J. Murrell wrote:
On Thu, 2009-01-22 at 15:44 +0000, Wojciech Turek wrote:
  
Hello,
    

Hi,

  
Lustre MDS report following error:
Jan 22 15:20:40 mds01.beowulf.cluster kernel: LustreError:
24680:0:(lov_request.c:692:lov_update_create_set()) error creating fid
0xeb79c9d sub-object on OST idx 4/1: rc = -28
    

-28 is ENOSPC.
 
  
Which I translate as that one of the OST (index 4/1) is full and has
no space left on device.
    

Yes.

  
OSS seem to be consistent and says:
Jan 22 15:21:15 storage08.beowulf.cluster kernel: LustreError:
23507:0:(filter_io_26.c:721:filter_commitrw_write()) error starting
transaction: rc = -30
    

Hrm.  I'm not sure a -30 (EROFS) would translate to a -28 to the MDS.  I
think it would also be a -30.  So are you sure you are looking at
correlating messages?  The timestamps, if the two nodes are in sync also
seem to indicate a lack of correlation with 35s of disparity.

Perhaps there is an actual -28 in the OSS log prior to the -30 one?
  
Yes you are right, there is plenty of messages with -30 in the logs and probably they are not related to -28.
Which  I translate as Client would like to write to an existing file
but it can't because file system is read only.
    

Indeed.  But why is it read-only?  There should be an event in the OSS
log saying that it was turning the filesystem read-only.

  
The OST device is still mounted with rw option
    

Yeah.  That's just the state at mount time.  Lustre will set a device
read-only in the case of filesystem errors, as one example.


  
Now the main question is why Lustre thinks that OST(idx4) is full?
    

No, I think the main question is why is it read-only.  The full
situation may have been transient where it filled up momentarily and
then some objects were removed.  In any case, this is a secondary issue
and really only need be considered once the read-only situation is
understood.
  
Thank you for puting me in right track. I found these in the syslog
Jan 22 10:18:40 storage08.beowulf.cluster kernel: LDISKFS-fs error (device dm-8): mb_free_blocks: double-free of inode 16203779's block 688627718(bit 8198 in group 21015)
Jan 22 10:18:40 storage08.beowulf.cluster kernel: 
Jan 22 10:18:40 storage08.beowulf.cluster kernel: Remounting filesystem read-only

Is this means that the file system may be corrupted? I am going to run fsck -f on this device and try to mount it back, is that a right procedure?
I did not find any errors on my S2A9500  storage, so I am not sure when this corruption could occur.


  
Is it possible that this OST have meny orphaned objects which takes
all the available space?
    

That would be reflected in the df.  If you suspect there may be orphan
objects though, you could lfsck to verify and clean.

  
Is there a way of reclaiming back this free space?
    

If you mean orphaned OST objects, then lfsck.

b.
  
Cheers

Wojciech
  

_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss

_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to