Re: [Lustre-discuss] Help needed in Building lustre using pre-packaged releases

ashok bharat bayana Thu, 13 Mar 2008 22:40:56 -0700


Hi,
Can anyone guide me in building the lustre using pre-packaged lustre 
release.I'm using Ubuntu 7.10 I want to build lustre using RHEL2.6 rpms 
available on my system.I'm referring how_to in wiki. but in that no detailed 
step by step procedure is given for building lustre using pre-packed release.


I'm in need of this.

Thanks and Regards,
Ashok Bharat
-----Original Message-----
From: [EMAIL PROTECTED] on behalf of [EMAIL PROTECTED]
Sent: Fri 3/14/2008 2:25 AM
To: [email protected]
Subject: Lustre-discuss Digest, Vol 26, Issue 36
 
Send Lustre-discuss mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.lustre.org/mailman/listinfo/lustre-discuss
or, via email, send a message with subject or body 'help' to
        [EMAIL PROTECTED]

You can reach the person managing the list at
        [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Lustre-discuss digest..."


Today's Topics:

   1. Re: OSS not healty (Andreas Dilger)
   2. Re: e2scan for backup (Andreas Dilger)
   3. Howto map block devices to Lustre devices? (Chris Worley)
   4. Re: e2fsck mdsdb: DB_NOTFOUND (Aaron Knister)
   5. Re: e2fsck mdsdb: DB_NOTFOUND (Karen M. Fernsler)
   6. Re: Howto map block devices to Lustre devices? (Klaus Steden)


----------------------------------------------------------------------

Message: 1
Date: Thu, 13 Mar 2008 11:11:19 -0700
From: Andreas Dilger <[EMAIL PROTECTED]>
Subject: Re: [Lustre-discuss] OSS not healty
To: "Brian J. Murrell" <[EMAIL PROTECTED]>
Cc: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=us-ascii

On Mar 13, 2008  13:44 +0100, Brian J. Murrell wrote:
> On Thu, 2008-03-13 at 12:34 +0100, Frank Mietke wrote:
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701448] attempt to access beyond 
> > end of device
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701454] sda: rw=1, 
> > want=11287722456, limit=7796867072
> 
> This is pretty self-explanatory.  Something tried to read beyond the end
> of the disk.  Something has a misunderstanding of how big the disk is.
> Is it possible that the disk format process was misled about the disk
> size during initialization?

Unlikely.

> Andreas, does mkfs do any bounds checking to verify the sanity of the
> mkfs request?  I.e. does it make sure that if/when you specify a number
> of blocks for a filesystem that that many block are available?

Yes, mke2fs will zero out the last ~128kB of the device to overwrite any
MD RAID signatures, and also verify that the device is as big as requested.

These kind of errors are usually a result of corruption internal to the
filesystem, and some garbage is interpreted as a block number beyond the
end of the device.

> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701555] attempt to access beyond 
> > end of device
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701558] sda: rw=1, 
> > want=25366292592, limit=7796867072
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701562] Buffer I/O error on 
> > device sda, logical block 3170786573
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701785] lost page write due to 
> > I/O error on sda
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.702004] Aborting journal on 
> > device sda.
> 
> This is all just fallout error messages from the attempted read beyond
> EOF.

Time to unmount the filesystem and run a full e2fsck "e2fsck -fp /dev/sdaNNN"

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



------------------------------

Message: 2
Date: Thu, 13 Mar 2008 11:22:48 -0700
From: Andreas Dilger <[EMAIL PROTECTED]>
Subject: Re: [Lustre-discuss] e2scan for backup
To: Jakob Goldbach <[EMAIL PROTECTED]>
Cc: Lustre User Discussion Mailing List
        <[email protected]>
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=us-ascii

On Mar 13, 2008  12:59 +0100, Jakob Goldbach wrote:
> On Wed, 2008-03-12 at 23:12 +0100, Brian J. Murrell wrote:
> > On Wed, 2008-03-12 at 14:50 -0600, Lundgren, Andrew wrote:
> > > How do you do the snapshot?
> > 
> > lvcreate -s
> 
> No need to freeze the filesystem while creating the snapshot to ensure a
> consistent filesystem on the snapshot ?

Yes, but this is handled internally by LVM and ext3 when the snapshot
is created.

> (xfs has a xfs_freeze function that does just this)

In fact I was just discussing this with an XFS developer and this is
a source of problems for them because if you do xfs_freeze before doing
the LVM snapshot it will deadlock.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



------------------------------

Message: 3
Date: Thu, 13 Mar 2008 13:50:51 -0600
From: "Chris Worley" <[EMAIL PROTECTED]>
Subject: [Lustre-discuss] Howto map block devices to Lustre devices?
To: lustre-discuss <[email protected]>
Message-ID:
        <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1

I'm trying to deactivate some OST's, but to find them I've been
searching through /var/log/messages, as in:

# ssh io2 grep -e sde -e sdf -e sdj -e sdk -e sdd /var/log/messages"*"
| grep Server
/var/log/messages:Mar 10 13:27:54 io2 kernel: Lustre: Server
ddnlfs-OST0035 on device /dev/sdf has started
/var/log/messages.1:Mar  4 16:02:13 io2 kernel: Lustre: Server
ddnlfs-OST0030 on device /dev/sdf has started
/var/log/messages.1:Mar  6 14:34:44 io2 kernel: Lustre: Server
ddnlfs-OST002e on device /dev/sdd has started
/var/log/messages.1:Mar  6 14:34:55 io2 kernel: Lustre: Server
ddnlfs-OST002f on device /dev/sde has started
/var/log/messages.1:Mar  6 14:35:16 io2 kernel: Lustre: Server
ddnlfs-OST0030 on device /dev/sdf has started
/var/log/messages.1:Mar  6 15:20:48 io2 kernel: Lustre: Server
ddnlfs-OST002f on device /dev/sde has started
/var/log/messages.1:Mar  6 16:08:38 io2 kernel: Lustre: Server
ddnlfs-OST002e on device /dev/sdd has started
/var/log/messages.1:Mar  6 16:08:43 io2 kernel: Lustre: Server
ddnlfs-OST0030 on device /dev/sdf has started
/var/log/messages.1:Mar  6 16:08:53 io2 kernel: Lustre: Server
ddnlfs-OST0034 on device /dev/sdj has started

Note that there isn't an entry for sdk (probably rotated out), and sdf
has two different names.

Is there a better way to find the right Lustre device name map to
Linux block device?

I'm trying to cull-out slow disks.  I'm hoping that just by
"deactivating" the device in lctl, it'll quit using it, and that's the
best way to get rid of a slow drive... correct?

Thanks,

Chris


------------------------------

Message: 4
Date: Thu, 13 Mar 2008 16:50:04 -0400
From: Aaron Knister <[EMAIL PROTECTED]>
Subject: Re: [Lustre-discuss] e2fsck mdsdb: DB_NOTFOUND
To: Michelle Butler <[EMAIL PROTECTED]>
Cc: Andreas Dilger <[EMAIL PROTECTED]>, [EMAIL PROTECTED],
        [EMAIL PROTECTED], [EMAIL PROTECTED],   alex parga
        <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes

What version of lustre/kernel is running on the problematic server?

On Mar 13, 2008, at 11:02 AM, Michelle Butler wrote:

> We got past that point by e2fsck the individual partitions first.
>
> But we are still having problems.. I'm sorry to
> say.   we have an I/O server that is fine until
> we start Lustre.  It starts spewing lustre call traces :
>
> Call
> Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> <ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
>        <ffffffff8013327d>{default_wake_function+0}
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
>        <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80110ebb>{child_rip+8}
>        <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> <ffffffff80110eb3>{child_rip+0}
>
> ll_ost_io_232 S 000001037d6bbee8     0 26764      1         26765  
> 26763 (L-TLB)
> 000001037d6bbe58 0000000000000046 0000000100000246 0000000000000003
>        0000000000000016 0000000000000001 00000104100bcb20  
> 0000000300000246
>        00000103f5470030 000000000001d381
> Call
> Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> <ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
>        <ffffffff8013327d>{default_wake_function+0}
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
>        <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80110ebb>{child_rip+8}
>        <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> <ffffffff80110eb3>{child_rip+0}
>
> ll_ost_io_233 S 00000103de847ee8     0 26765      1         26766  
> 26764 (L-TLB)
> 00000103de847e58 0000000000000046 0000000100000246 0000000000000001
>        0000000000000016 0000000000000001 000001040f83c620  
> 0000000100000246
>        00000103e627e030 000000000001d487
> Call
> Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> <ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
>        <ffffffff8013327d>{default_wake_function+0}
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
>        <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80110ebb>{child_rip+8}
>        <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> <ffffffff80110eb3>{child_rip+0}
>
> ll_ost_io_234 S 00000100c4353ee8     0 26766      1         26767  
> 26765 (L-TLB)
> 00000100c4353e58 0000000000000046 0000000100000246 0000000000000003
>        0000000000000016 0000000000000001 00000104100bcc60  
> 0000000300000246
>        00000103de81b810 000000000001d945
> Call
> Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> <ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
>        <ffffffff8013327d>{default_wake_function+0}
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
>         
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retr???f?????????c?????????c??????
>                                                           
> Ks[F????????????
> <ffffffff8013327d>{default_wake_function+0}
> <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
>        <ffffffffa03e0156>{:ptl
>
> It then panic's the kernel.. ??
>
> Michelle Butler
>
> At 02:39 AM 3/13/2008, Andreas Dilger wrote:
>> On Mar 12, 2008  06:44 -0500, Karen M. Fernsler wrote:
>>> I'm running:
>>>
>>> e2fsck -y -v --mdsdb mdsdb --ostdb osth3_1 /dev/mapper/27l4
>>>
>>> and getting:
>>>
>>> Pass 6: Acquiring information for lfsck
>>> error getting mds_hdr (3685469441:8) in
>> /post/cfg/mdsdb: DB_NOTFOUND: No matching key/data pair found
>>> e2fsck: aborted
>>>
>>> Any ideas how to get around this?
>>
>> Does "mdsdb" actually exist?  This should be created by first  
>> running:
>>
>> e2fsck --mdsdb mdsdb /dev/{mdsdevicename}
>>
>> before running your above command on the OST.
>>
>> Please also try specifying the absolute pathname for the mdsdb and  
>> ostdb
>> files.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
[EMAIL PROTECTED]






------------------------------

Message: 5
Date: Thu, 13 Mar 2008 15:51:22 -0500
From: "Karen M. Fernsler" <[EMAIL PROTECTED]>
Subject: Re: [Lustre-discuss] e2fsck mdsdb: DB_NOTFOUND
To: Aaron Knister <[EMAIL PROTECTED]>
Cc: Andreas Dilger <[EMAIL PROTECTED]>, [EMAIL PROTECTED],
        Michelle Butler <[EMAIL PROTECTED]>, [EMAIL PROTECTED],
        [EMAIL PROTECTED], alex parga <[EMAIL PROTECTED]>,
        [EMAIL PROTECTED]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=iso-8859-1

2.6.9-42.0.10.EL_lustre-1.4.10.1smp

This is a 2.6.9-42.0.10.E kernel with lustre-1.4.10.1.

This has been working ok for almost a year.  We did try to
export this filesystem to another cluster over nfs before
we started seeing problems, but I don't know how related if
at all that is.

We are now trying to dissect the problem by inspecting
the switch logs these nodes are connected to.

thanks,
-k

On Thu, Mar 13, 2008 at 04:50:04PM -0400, Aaron Knister wrote:
> What version of lustre/kernel is running on the problematic server?
> 
> On Mar 13, 2008, at 11:02 AM, Michelle Butler wrote:
> 
> >We got past that point by e2fsck the individual partitions first.
> >
> >But we are still having problems.. I'm sorry to
> >say.   we have an I/O server that is fine until
> >we start Lustre.  It starts spewing lustre call traces :
> >
> >Call
> >Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> ><ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
> >       <ffffffff8013327d>{default_wake_function+0}
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> >       <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> ><ffffffff80110ebb>{child_rip+8}
> >       <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> ><ffffffff80110eb3>{child_rip+0}
> >
> >ll_ost_io_232 S 000001037d6bbee8     0 26764      1         26765  
> >26763 (L-TLB)
> >000001037d6bbe58 0000000000000046 0000000100000246 0000000000000003
> >       0000000000000016 0000000000000001 00000104100bcb20  
> >0000000300000246
> >       00000103f5470030 000000000001d381
> >Call
> >Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> ><ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
> >       <ffffffff8013327d>{default_wake_function+0}
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> >       <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> ><ffffffff80110ebb>{child_rip+8}
> >       <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> ><ffffffff80110eb3>{child_rip+0}
> >
> >ll_ost_io_233 S 00000103de847ee8     0 26765      1         26766  
> >26764 (L-TLB)
> >00000103de847e58 0000000000000046 0000000100000246 0000000000000001
> >       0000000000000016 0000000000000001 000001040f83c620  
> >0000000100000246
> >       00000103e627e030 000000000001d487
> >Call
> >Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> ><ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
> >       <ffffffff8013327d>{default_wake_function+0}
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> >       <ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> ><ffffffff80110ebb>{child_rip+8}
> >       <ffffffffa03e0163>{:ptlrpc:ptlrpc_main+0}
> ><ffffffff80110eb3>{child_rip+0}
> >
> >ll_ost_io_234 S 00000100c4353ee8     0 26766      1         26767  
> >26765 (L-TLB)
> >00000100c4353e58 0000000000000046 0000000100000246 0000000000000003
> >       0000000000000016 0000000000000001 00000104100bcc60  
> >0000000300000246
> >       00000103de81b810 000000000001d945
> >Call
> >Trace:<ffffffffa02fa089>{:libcfs:lcw_update_time+22}
> ><ffffffffa03e06e3>{:ptlrpc:ptlrpc_main+1408}
> >       <ffffffff8013327d>{default_wake_function+0}
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> >        
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retr???f?????????c?????????c??????
> >                                                          
> >Ks[F????????????
> ><ffffffff8013327d>{default_wake_function+0}
> ><ffffffffa03e0156>{:ptlrpc:ptlrpc_retry_rqbds+0}
> >       <ffffffffa03e0156>{:ptl
> >
> >It then panic's the kernel.. ??
> >
> >Michelle Butler
> >
> >At 02:39 AM 3/13/2008, Andreas Dilger wrote:
> >>On Mar 12, 2008  06:44 -0500, Karen M. Fernsler wrote:
> >>>I'm running:
> >>>
> >>>e2fsck -y -v --mdsdb mdsdb --ostdb osth3_1 /dev/mapper/27l4
> >>>
> >>>and getting:
> >>>
> >>>Pass 6: Acquiring information for lfsck
> >>>error getting mds_hdr (3685469441:8) in
> >>/post/cfg/mdsdb: DB_NOTFOUND: No matching key/data pair found
> >>>e2fsck: aborted
> >>>
> >>>Any ideas how to get around this?
> >>
> >>Does "mdsdb" actually exist?  This should be created by first  
> >>running:
> >>
> >>e2fsck --mdsdb mdsdb /dev/{mdsdevicename}
> >>
> >>before running your above command on the OST.
> >>
> >>Please also try specifying the absolute pathname for the mdsdb and  
> >>ostdb
> >>files.
> >>
> >>Cheers, Andreas
> >>--
> >>Andreas Dilger
> >>Sr. Staff Engineer, Lustre Group
> >>Sun Microsystems of Canada, Inc.
> >
> >
> >_______________________________________________
> >Lustre-discuss mailing list
> >[email protected]
> >http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> [EMAIL PROTECTED]
> 
> 
> 

-- 
Karen Fernsler Systems Engineer
National Center for Supercomputing Applications
ph: (217) 265 5249
email: [EMAIL PROTECTED]


------------------------------

Message: 6
Date: Thu, 13 Mar 2008 13:55:45 -0700
From: Klaus Steden <[EMAIL PROTECTED]>
Subject: Re: [Lustre-discuss] Howto map block devices to Lustre
        devices?
To: Chris Worley <[EMAIL PROTECTED]>,   lustre-discuss
        <[email protected]>
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain;       charset="US-ASCII"


Hi Chris,

Don't your Lustre volumes have a label on them?

On the one cluster I've got, the physical storage is shared with a number of
other systems, so the device information can change over time ... so I use
device labels in my /etc/fstab and friends.

Something like 'lustre-OST0000', 'lustre-OST00001' ... although when the
devices are actually mounted, they show up with their /dev node names.

Look through /proc/fs/lustre for Lustre volume names (they show up when
they're mounted), and you can winnow your list down by mounting by name,
checking the device ID, and removing it that way.

If you have a lot of devices on the same bus, it will likely take a bit for
the right one to be found, but it's there.

hth,
Klaus

On 3/13/08 12:50 PM, "Chris Worley" <[EMAIL PROTECTED]>did etch on stone
tablets:

> I'm trying to deactivate some OST's, but to find them I've been
> searching through /var/log/messages, as in:
> 
> # ssh io2 grep -e sde -e sdf -e sdj -e sdk -e sdd /var/log/messages"*"
> | grep Server
> /var/log/messages:Mar 10 13:27:54 io2 kernel: Lustre: Server
> ddnlfs-OST0035 on device /dev/sdf has started
> /var/log/messages.1:Mar  4 16:02:13 io2 kernel: Lustre: Server
> ddnlfs-OST0030 on device /dev/sdf has started
> /var/log/messages.1:Mar  6 14:34:44 io2 kernel: Lustre: Server
> ddnlfs-OST002e on device /dev/sdd has started
> /var/log/messages.1:Mar  6 14:34:55 io2 kernel: Lustre: Server
> ddnlfs-OST002f on device /dev/sde has started
> /var/log/messages.1:Mar  6 14:35:16 io2 kernel: Lustre: Server
> ddnlfs-OST0030 on device /dev/sdf has started
> /var/log/messages.1:Mar  6 15:20:48 io2 kernel: Lustre: Server
> ddnlfs-OST002f on device /dev/sde has started
> /var/log/messages.1:Mar  6 16:08:38 io2 kernel: Lustre: Server
> ddnlfs-OST002e on device /dev/sdd has started
> /var/log/messages.1:Mar  6 16:08:43 io2 kernel: Lustre: Server
> ddnlfs-OST0030 on device /dev/sdf has started
> /var/log/messages.1:Mar  6 16:08:53 io2 kernel: Lustre: Server
> ddnlfs-OST0034 on device /dev/sdj has started
> 
> Note that there isn't an entry for sdk (probably rotated out), and sdf
> has two different names.
> 
> Is there a better way to find the right Lustre device name map to
> Linux block device?
> 
> I'm trying to cull-out slow disks.  I'm hoping that just by
> "deactivating" the device in lctl, it'll quit using it, and that's the
> best way to get rid of a slow drive... correct?
> 
> Thanks,
> 
> Chris
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



------------------------------

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss


End of Lustre-discuss Digest, Vol 26, Issue 36
**********************************************

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Help needed in Building lustre using pre-packaged releases

Reply via email to