from:"Robin Humble"

Re: [lustre-discuss] Migrating files doesn't free space on the OST

2019-01-17 Thread Robin Humble

On Wed, Jan 16, 2019 at 04:25:25PM +, Jason Williams wrote:
>I am trying to migrate files I know are not in use off of the full OST that I 
>have using lfs migrate.  I have verified up and down that the files I am 
>moving are on that OST and that after the migrate lfs getstripe indeed shows 
>they are no longer on that OST since it's disabled in the MDS.
>
>The problem is, the used space on the OST is not going down.
>
>I see one of at least two issues:
>
>- the OST is just not freeing the space for some reason or another ( I don't 
>know)

if you are using an older Lustre version (eg. IEEL) then you may have
to re-enable the OST on the MDS to allow deletes to occur on the OST.
then check no new files went there while it was enabled, and possibly
loop and repeat.

the newer ways of disabling file creation on OSTs in recent Lustre
versions don't have this problem.

>- Or someone is writing to existing files just as fast as I am clearing the 
>data (possible, but kind of hard to find)
>
>Is there possibly something else I am missing? Also, does anyone know a good 
>way to see if some client is writing to that OST and determine who it is if 
>it's more probable that that is what is going on?

perhaps check 'lsof' on every client.
if a client has a file open then it can't be deleted.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre Sizing

2019-01-01 Thread Robin Humble

On Tue, Jan 01, 2019 at 01:05:22PM +0530, ANS wrote:
>So what could be the reason for this variation of the size.

with our ZFS 0.7.9 + Lustre 2.10.6 the "lfs df" numbers seem to be the
same as those from "zfs list" (not "zpool list").

so I think your question is more about ZFS than Lustre.

the number of devices in each ZFS vdev, raid level, what size files you
write, ashift, recordsize, ... all will affect the total space available.
see attached for an example.

ZFS's space estimates are also pessimistic as it doesn't know what size
files are going to be written.

if you want more accurate numbers then perhaps create a small but
realistic zpool and zfs filesystem (using say, 1G files as devices) and
then fill it up with files representative of your workload and see how
many fit on. I just filled them up with large dd's to make the above
graph, so YMMV.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre traffic slow on OPA fabric network

2018-08-15 Thread Robin Humble

Hi Kurt,

On Thu, Jul 12, 2018 at 02:36:49PM -0400, Kurt Strosahl wrote:
>   That's really helpful.  The version on the servers is IEEL 2.5.42, while 
> the routers and OPA nodes are all running 2.10.4... We'be been looking at 
> upgrading our old system to 2.10 or 2.11.

just an update on this. we moved our old 2.5 IEEL lustre to 2.10.4
(still rhel6.x) but sadly it didn't solve our lnet routing problem.
sorry for the bad advice.

>   I checked the opa clients and the lnet routers, they all use the same 
> parameters that you do except for the map_on_demand (which our system 
> defaults to 256).

we eventually realised that with the "new" ways of setting ko2iblnd and
lnet options we could configure each card (qib/mlnx, opa) separately and
have them "optimal", but still doesn't work without errors so far.

haven't 100% ruled out shonky FINSTAR opa optical cables yet, but it
seems quite unlikely.

did you make any progress?

cheers,
robin

>
>w/r,
>Kurt
>
>- Original Message -
>From: "Robin Humble" 
>To: "Kurt Strosahl" 
>Cc: lustre-discuss@lists.lustre.org
>Sent: Tuesday, July 10, 2018 5:03:30 AM
>Subject: Re: [lustre-discuss] Lustre traffic slow on OPA fabric network
>
>Hi Kurt,
>
>On Tue, Jul 03, 2018 at 02:59:22PM -0400, Kurt Strosahl wrote:
>>   I've been seeing a great deal of slowness from clients on an OPA network 
>> accessing lustre through lnet routers.  The nodes take very long to complete 
>> things like lfs df, and show lots of dropped / reestablished connections.  
>> The OSS systems show this as well, and occasionally will report that all 
>> routes are down to a host on the omnipath fabric.  They also show large 
>> numbers of bulk callback errors.  The lnet router show large numbers of 
>> PUT_NACK messages, as well as Abort reconnection messages for nodes on the 
>> OPA fabric.
>
>I don't suppose you're talking to a super-old Lustre version via the
>lnet routers?
>
>we see excellent performance OPA to IB via lnet routers wth 2.10.x
>clients and 2.9 servers, but when we try to talk to a IEEL 2.5.41
>servers then we see pretty much exactly the symptoms you describe.
>
>strangely direct mounts of old lustre on new clients on IB work ok, but
>not via lnet routers to OPA. old lustre to new clients on tcp networks
>are ok. lnet self tests OPA to IB also work fine, it's just when we do
>the actual mounts...
>anyway, we are going to try and resolve the problem by updating the
>IEEL to 2.9 or 2.10
>
>hmm, now that I think of it, we did have to tweak the ko2iblnd options
>a lot on the lnet router to get it this stable. I forget the symptoms
>we were seeing though, sorry.
>we found the minimum common denominator settings between the IB network
>and the OPA, and tuned ko2iblnd on the lnet routers down to that. if it
>finds one OPA card then Lustre imposes an agressive OPA config on all
>IB networks which made our mlx4 cards on a ipath/qib fabric unhappy.
>
>FWIW, for our hardware combo, ko2iblnd options are
>
>  options ko2iblnd-opa peer_credits=8 peer_credits_hiw=0 credits=256 
> concurrent_sends=0 ntx=512 map_on_demand=0 fmr_pool_size=512 
> fmr_flush_trigger=384 fmr_cache=1 conns_per_peer=1
>
>I don't know what most of these do, so please take with a grain of salt.
>
>cheers,
>robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre traffic slow on OPA fabric network

2018-07-10 Thread Robin Humble

Hi Kurt,

On Tue, Jul 03, 2018 at 02:59:22PM -0400, Kurt Strosahl wrote:
>   I've been seeing a great deal of slowness from clients on an OPA network 
> accessing lustre through lnet routers.  The nodes take very long to complete 
> things like lfs df, and show lots of dropped / reestablished connections.  
> The OSS systems show this as well, and occasionally will report that all 
> routes are down to a host on the omnipath fabric.  They also show large 
> numbers of bulk callback errors.  The lnet router show large numbers of 
> PUT_NACK messages, as well as Abort reconnection messages for nodes on the 
> OPA fabric.

I don't suppose you're talking to a super-old Lustre version via the
lnet routers?

we see excellent performance OPA to IB via lnet routers wth 2.10.x
clients and 2.9 servers, but when we try to talk to a IEEL 2.5.41
servers then we see pretty much exactly the symptoms you describe.

strangely direct mounts of old lustre on new clients on IB work ok, but
not via lnet routers to OPA. old lustre to new clients on tcp networks
are ok. lnet self tests OPA to IB also work fine, it's just when we do
the actual mounts...
anyway, we are going to try and resolve the problem by updating the
IEEL to 2.9 or 2.10

hmm, now that I think of it, we did have to tweak the ko2iblnd options
a lot on the lnet router to get it this stable. I forget the symptoms
we were seeing though, sorry.
we found the minimum common denominator settings between the IB network
and the OPA, and tuned ko2iblnd on the lnet routers down to that. if it
finds one OPA card then Lustre imposes an agressive OPA config on all
IB networks which made our mlx4 cards on a ipath/qib fabric unhappy.

FWIW, for our hardware combo, ko2iblnd options are

  options ko2iblnd-opa peer_credits=8 peer_credits_hiw=0 credits=256 
concurrent_sends=0 ntx=512 map_on_demand=0 fmr_pool_size=512 
fmr_flush_trigger=384 fmr_cache=1 conns_per_peer=1

I don't know what most of these do, so please take with a grain of salt.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] flock vs localflock

2018-07-10 Thread Robin Humble

Hi Darby,

On Thu, Jul 05, 2018 at 09:26:36PM +, Vicker, Darby (JSC-EG311) wrote:
>Also, the ldlm processes lead us to looking at flock vs localflock.  On 
>previous generations of our LFS???s, we used localflock.  But on the current 
>LFS, we decided to try flock instead.  This LFS has been in production for a 
>couple years with no obvious problems due to flock but we decided to drop back 
>to localflock as a precaution for now.  We need to do a more controlled test 
>but this does seem to help.  What are other sites using for locking parameters?

we use flock for /home and the large scratch filesystem. have done for
probably 10 years. localflock for the read-only software installs in
/apps, and no locking for the OS image (overlayfs with ramdisk upper,
read-only Lustre lower).

we are all ZFS and 2.10.4 too.

I don't think we have much in the way of flock user codes, so I can't
actually recall any issues along those lines.

the most common MDS abusing load we see is jobs across multiple nodes
appending to the same (by definition rubbish) output file. the write
lock bounces between nodes and causes high MDS load, poor performance
for those clients nodes, bit slower for everyone. I look for these
simply with 'lsof' and correlate across nodes.

HTH

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] dealing with maybe dead OST

2018-06-26 Thread Robin Humble

Hi Andreas,

On Wed, Jun 20, 2018 at 05:39:33PM +, Andreas Dilger wrote:
>On Jun 19, 2018, at 09:33, Robin Humble  wrote:
>> is there a way to mv files when their OST is unreachable?
>> ...
>> the only thing I've thought of seems pretty out there...
>> mount the MDT as ldiskfs and mv the affected files into the shadow
>> tree at the ldiskfs level.
>> ie. with lustre running and mounted, create an empty shadow tree of
>> all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on
>> the MDT:
>>  for f in ; do
>> mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f
>>  done
>> 
>> would that work?
>
>This would work to some degree, but the "link" xattr on each file
>would not be updated, so "lfs fid2path" would be broken until a
>full LFSCK is run.

although as you say, it turns out the rename() approach at the client
level will work fine, it's still good to know that Lustre is flexible
and robust enough for some crazy stuff to work if it had to :)

>> alternatively, should we just unlink all the currently dead files from
>> lustre now, and then if the OST comes back can we reconstruct the paths
>> and filenames from the FID in xattrs's on the revived OST?
>> I suspect unlink is final though and this wouldn't work... ?
>
>That would be possible, but overly complex, since the inodes would be
>removed from the MDT and you'd need to reconstruct them with LFSCK and
>find the names, as LFSCK would dump them all into $MNT/.lustre/lost+found.
>
>> we can also take an lvm snapshot of the MDT and refer to that later I
>> suppose, but I'm not sure how that might help us.
>
>It should be possible to copy the unlinked files from the backup MDT
>to the current MDT (via ldiskfs), along with an LFSCK run to rebuild
>the OI files.  It is always a good idea to have an MDT device-level
>backup before you do anything drastic like this.  However, for the
>meantime I think that renaming the broken files to a root-only directory
>is the safest.

thanks (as always) for all the detailed explanations.
much appreciated.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

2018-06-26 Thread Robin Humble

On Tue, Jun 26, 2018 at 04:05:14PM +0800, yu sun wrote:
>hi all:
> I want to build a lustre storage system, and I found not all of the
>machine in the same sub-network, and they cant lctl ping with each other.
>the details list as below:
>
>root@ml-storage-ser30.nmg01:~$ lctl list_nids
>10.82.145.2@o2ib
>root@ml-storage-ser30.nmg01:~$ lctl ping node28@o2ib
>failed to ping 10.82.143.202@o2ib: Input/output error
>root@ml-storage-ser30.nmg01:~$

what does 'lctl list_nids' say on node28?
also disable iptables everywhere.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] dealing with maybe dead OST

2018-06-20 Thread Robin Humble

On Wed, Jun 20, 2018 at 10:20:09AM -0400, Robin Humble wrote:
>On Tue, Jun 19, 2018 at 08:54:53PM +, Cowe, Malcolm J wrote:
>>Would using hard links work, instead of mv?

ah. success! looks like it's just that gnu 'mv' and 'ln' are wy too
smart for their own good. you got me thinking... what are 'mv' and 'ln'
doing lstat() for anyway?

so I wrote a few lines of C and stdio's rename() "just works" on the
client, even when the OST is disabled (as it damn well should).
too easy...
happily python's os.rename() works too ('cos I am lazy)

whoo! no need to mess with the MDT. that's a relief.

thanks :)

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] dealing with maybe dead OST

2018-06-20 Thread Robin Humble

Hi Malcolm,

thanks for replying.

On Tue, Jun 19, 2018 at 08:54:53PM +, Cowe, Malcolm J wrote:
>Would using hard links work, instead of mv?

hmm, interesting idea, but no:
  # ln some_file /lustre/shadow/some_file
  ln: failed to access 'some_file' Cannot send after transport endpoint shutdown

ln is trying to lstat() which fails. I think almost all client
operations are going to fail with a deactivated/down OST.

things like 'lfs getstripe' (pure MDS ops) work ok.

or did you mean doing hard links on the MDT?

unless there's a purely MDS lustre tool to do a mv/rename operation on
the MDT, then I think the only option is to mess around with the low
level suff on the MDT when it's mounted as ldiskfs and hope I don't
break too much...

there used to be a 'lfs mv' (now 'lfs migrate') but that isn't quite the
mv operations I'm after.

any advice or war stories (especially "this is a waste of your time -
it will never work because of X,Y,Z") would be much appreciated :)

time to read more of the lustre manual now...

cheers,
robin


>Malcolm.
> 
>
>???On 20/6/18, 1:34 am, "lustre-discuss on behalf of Robin Humble" 
>rjh+lus...@cita.utoronto.ca> wrote:
>
>Hi,
>
>so we've maybe lost 1 OST out of a filesystem with 115 OSTs. we may
>still be able to get the OST back, but it's been a month now so
>there's pressure to get the cluster back and working and leave the
>files missing for now...
>
>the complication is that because the OST might come back to life we
>would like to avoid the users rm'ing their broken files and potentially
>deleting them forever.
>
>lustre is 2.5.41 ldiskfs centos6.x x86_64.
>
>ideally I think we'd move all the ~2M files on the OST to a root access
>only "shadow" directory tree in lustre that's populated purely with
>files from the dead OST.
>if we manage to revive the OST then these can magically come back to
>life and we can mv them back into their original locations.
>
>but currently
>  mv: cannot stat 'some_file': Cannot send after transport endpoint 
> shutdown
>the OST is deactivated on the client. the client hangs if the OST isn't
>deactivated. the OST is still UP & activated on the MDS.
>
>is there a way to mv files when their OST is unreachable?
>
>seems like mv is an MDT operation so it should be possible somehow?
>
>
>the only thing I've thought of seems pretty out there...
>mount the MDT as ldiskfs and mv the affected files into the shadow
>tree at the ldiskfs level.
>ie. with lustre running and mounted, create an empty shadow tree of
>all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on
>the MDT:
>  for f in ; do
> mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f
>  done
>
>would that work?
>maybe we'd also have to rebuild OI's and lfsck - something along the
>lines of the MDT restore procedure in the manual. hopefully that would
>all work with an OST deactivated.
>
>
>alternatively, should we just unlink all the currently dead files from
>lustre now, and then if the OST comes back can we reconstruct the paths
>and filenames from the FID in xattrs's on the revived OST?
>I suspect unlink is final though and this wouldn't work... ?
>
>we can also take an lvm snapshot of the MDT and refer to that later I
>suppose, but I'm not sure how that might help us.
>
>as you can probably tell I haven't had to deal with this particular
>situation before :)
>
>thanks for any help.
>
>cheers,
>robin
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] dealing with maybe dead OST

2018-06-19 Thread Robin Humble

Hi,

so we've maybe lost 1 OST out of a filesystem with 115 OSTs. we may
still be able to get the OST back, but it's been a month now so
there's pressure to get the cluster back and working and leave the
files missing for now...

the complication is that because the OST might come back to life we
would like to avoid the users rm'ing their broken files and potentially
deleting them forever.

lustre is 2.5.41 ldiskfs centos6.x x86_64.

ideally I think we'd move all the ~2M files on the OST to a root access
only "shadow" directory tree in lustre that's populated purely with
files from the dead OST.
if we manage to revive the OST then these can magically come back to
life and we can mv them back into their original locations.

but currently
  mv: cannot stat 'some_file': Cannot send after transport endpoint shutdown
the OST is deactivated on the client. the client hangs if the OST isn't
deactivated. the OST is still UP & activated on the MDS.

is there a way to mv files when their OST is unreachable?

seems like mv is an MDT operation so it should be possible somehow?


the only thing I've thought of seems pretty out there...
mount the MDT as ldiskfs and mv the affected files into the shadow
tree at the ldiskfs level.
ie. with lustre running and mounted, create an empty shadow tree of
all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on
the MDT:
  for f in ; do
 mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f
  done

would that work?
maybe we'd also have to rebuild OI's and lfsck - something along the
lines of the MDT restore procedure in the manual. hopefully that would
all work with an OST deactivated.


alternatively, should we just unlink all the currently dead files from
lustre now, and then if the OST comes back can we reconstruct the paths
and filenames from the FID in xattrs's on the revived OST?
I suspect unlink is final though and this wouldn't work... ?

we can also take an lvm snapshot of the MDT and refer to that later I
suppose, but I'm not sure how that might help us.

as you can probably tell I haven't had to deal with this particular
situation before :)

thanks for any help.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] High MDS load, but no activity

2017-07-27 Thread Robin Humble

Hi Kevin,

On Thu, Jul 27, 2017 at 08:18:04AM -0400, Kevin M. Hildebrand wrote:
>We recently updated to Lustre 2.8 on our cluster, and have started seeing
>some unusal load issues.
>Last night our MDS load climbed to well over 100, and client performance
>dropped to almost zero.
>Initially this appeared to be related to a number of jobs that were doing
>large numbers of opens/closes, but even after killing those jobs, the MDS
>load did not recover.
>
>Looking at stats in /proc/fs/lustre/mdt/scratch-MDT/exports showed
>little to no activity on the MDS.  Looking at iostat showed almost no disk
>activity to the MDT (or to any device, for that matter), and minimal IO wait.
>Memory usage (the machine has 128GB) showed over half of that memory free.

sounds like VM spinning to me. check /proc/zoneinfo, /proc/vmstat etc.

do you have zone_reclaim_mode=0? that's an olde, but important to have
set to zero.
 sysctl vm.zone_reclaim_mode

failing that (and assuming you have a 2 or more numa zone server) I
would guess it's all the zone affinity stuff in lustre these days.
you can turn most of it off with a modprobe option
  options libcfs cpu_npartitions=1

what happens by default is that a bunch of lustre threads are bound to
numa zones and preferentially and agressively allocate kernel ram in
those zones. in practice this usually means that the zone where IB card
is physically attached fills up, and then the machine is (essentially)
out of ram and spinning hard trying to reclaim, even though all the ram
in the other zone(s) is almost all unused.

I tried to talk folks out of having affinity on by default in
  https://jira.hpdd.intel.com/browse/LU-5050
but didn't succeed.

even if it wasn't unstable to have affinity on, IMHO having 2x the ram
available for caching on the MDS and OSS's is #1, and tiny performance
increases from having that ram next to the IB card is a distant #2.

cheers,
robin

>I eventually ended up unmounting the MDT and failing it over to a backup
>MDS, which promptly recovered and now has a load of near zero.
>
>Has anyone seen this before?  Any suggestions for what I should look at if
>this happens again?
>
>Thanks!
>Kevin
>
>--
>Kevin Hildebrand
>University of Maryland, College Park
>Division of IT

>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] seclabel

2017-06-06 Thread Robin Humble

On Tue, May 23, 2017 at 08:08:54PM +, Dilger, Andreas wrote:
>On May 19, 2017, at 08:47, Robin Humble <rjh+lus...@cita.utoronto.ca> wrote:
>> On Wed, May 17, 2017 at 02:37:31PM +, Sebastien Buisson wrote:
>>> Le 17 mai 2017 à 16:16, Robin Humble <rjh+lus...@cita.utoronto.ca> a écrit :
>>>> I took a gander at the source and noticed that llite/xattr.c
>>>> deliberately filters out 'security.capability' and returns 0/-ENODATA
>>>> for setcap/getcap, which is indeed what strace sees. so setcap/getcap
>>>> is never even sent to the MDS.
>>>> 
>>>> if I remove that filter (see patch on lustre-devel) then setcap/getcap
>>>> works ->
>> ...
>>>> 'b15587' is listed as the reason for the filtering.
>>>> I don't know what that refers to.
>>>> is it still relevant?
>>> b15587 refers to the old Lustre Bugzilla tracking tool:
>>> https://projectlava.xyratex.com/show_bug.cgi?id=15587
>>> 
>>> Reading the discussion in the ticket, supporting xattr at the time of 
>>> Lustre 1.8 and 2.0 was causing issues on MDS side in some situations. So it 
>>> was decided to discard security.capability xattr on Lustre client side. I 
>>> think Andreas might have some insight, as he apparently participated in 
>>> b15587.
>> 
>> my word that's a long time ago...
>> I don't see much in the way of jira tickets about getxattr issues on
>> MDS in recent times, and they're much more heavily used these days, so
>> I hope that particular problem has long since been fixed.
>> 
>> should I open a jira ticket to track re-enabling of security.capabilities?

LU-9562
thanks for everyone's help!

>I don't recall the details of b=15587 off the top of my head, but the 
>high-level issue is
>that the security labels added a significant performance overhead, since they 
>were retrieved
>on every file access, but not cached on the client, even if most systems never 
>used them.
>
>Seagate implemented the client-side xattr cache for Lustre 2.5, so this should 
>work a lot
>better these days.  I'm not 100% positive if we also cache negative xattr 
>lookups or not,
>so this would need some testing/tracing to see if it generates a large number 
>of RPCs.

fair enough.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] seclabel

2017-05-19 Thread Robin Humble

Hi Sebastien,

On Wed, May 17, 2017 at 02:37:31PM +, Sebastien Buisson wrote:
> Le 17 mai 2017 à 16:16, Robin Humble <rjh+lus...@cita.utoronto.ca> a écrit :
>> I took a gander at the source and noticed that llite/xattr.c
>> deliberately filters out 'security.capability' and returns 0/-ENODATA
>> for setcap/getcap, which is indeed what strace sees. so setcap/getcap
>> is never even sent to the MDS.
>> 
>> if I remove that filter (see patch on lustre-devel) then setcap/getcap
>> works ->
...
>> 'b15587' is listed as the reason for the filtering.
>> I don't know what that refers to.
>> is it still relevant?
>b15587 refers to the old Lustre Bugzilla tracking tool:
>https://projectlava.xyratex.com/show_bug.cgi?id=15587
>
>Reading the discussion in the ticket, supporting xattr at the time of Lustre 
>1.8 and 2.0 was causing issues on MDS side in some situations. So it was 
>decided to discard security.capability xattr on Lustre client side. I think 
>Andreas might have some insight, as he apparently participated in b15587.

my word that's a long time ago...
I don't see much in the way of jira tickets about getxattr issues on
MDS in recent times, and they're much more heavily used these days, so
I hope that particular problem has long since been fixed.

should I open a jira ticket to track re-enabling of security.capabilities?

>In any case, it is important to make clear that file capabilities, the feature 
>you want to use, is completely distinct from SELinux.
>On the one hand, Capabilities are a Linux mechanism to refine permissions 
>granted to privileged processes, by dividing the privileges traditionally 
>associated with superuser into distinct units (known as capabilities).
>On the other hand, SELinux is the Linux implementation of Mandatory Access 
>Control.
>Both Capabilities and SELinux rely on values stored into file extended 
>attributes, but this is the only thing they have in common.

10-4. thanks.

'ls --color' requests the security.capability xattr so this would
be heavily accessed. do you think this is handled well enough currently
to not affect performance significantly?

setxattr would be minimal and not performance critical, unlike with eg.
selinux and creat.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] seclabel

2017-05-17 Thread Robin Humble

I setup a couple of VMs with 2.9 clients and servers (ldiskfs) and
unfortunately setcap/getcap still are unhappy - same as with my
previous 2.9 clients with 2.8 servers (ZFS).

hmm.
I took a gander at the source and noticed that llite/xattr.c
deliberately filters out 'security.capability' and returns 0/-ENODATA
for setcap/getcap, which is indeed what strace sees. so setcap/getcap
is never even sent to the MDS.

if I remove that filter (see patch on lustre-devel) then setcap/getcap
works ->

 # df .
Filesystem1K-blocks  Used Available Use% Mounted on
10.122.1.5@tcp:/test8   4797904 33992   4491480   1% /mnt/test8
 # touch blah
 # setcap cap_net_admin,cap_net_raw+p blah
 # getcap blah
blah = cap_net_admin,cap_net_raw+p

and I also tested that the 'ping' binary run as unprivileged user works
from lustre.
success!

'b15587' is listed as the reason for the filtering.
I don't know what that refers to.
is it still relevant?

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] seclabel

2017-05-16 Thread Robin Humble

Hi Eli et al,

>> Le 15 mai 2017 à 14:39, E.S. Rosenberg  a écrit :
>> Hi Robin,
>> Did you ever solve this?
>> We are considering trying root-on-lustre but that would be a deal-breaker.

no. instead I started down the track of layering overlayfs on top of
lustre. tmpfs (used by overlayfs's upper layer) has a working seclabel
mount option. so I just 'copy up' the 3 or 4 exe's that have seclabels,
'setcap' them with the right label, and they work fine.

I'm not sure overlayfs is going to work out though, so I'd really like
seclabel in lustre.

On Tue, May 16, 2017 at 08:17:48AM +, Sebastien Buisson wrote:
>From Lustre 2.8, we have basic support of SELinux on Lustre client side. It 
>means Lustre stores the security context of files in extended attributes. In 
>this way Lustre supports seclabel.
>In Lustre 2.9, an additional enhancement for SELinux support was landed.
>
>Which version are you using?

2.9 clients, 2.8 servers on ZFS.
centos7 x86_64 everywhere.
sestatus disabled everywhere.
zfs has xattr=sa on osts, mdt, mgs

Andreas wrote (a while ago):
>> I try to stay away from that myself, but newer Lustre clients support SELinux
>> and similar things.  You probably need to strace and/or collect some kernel
>> debug logs (maybe with debug=-1 set) to see where the error is being 
>> generated.

a debug=-1 trace is here ->
  https://rjh.org/~rjh/lustre/dk.log.-1.txt.gz

command line was ->
  lctl set_param debug=-1 ; usleep 5; lctl clear; usleep 5 ; 
/usr/sbin/setcap cap_net_admin,cap_net_raw+p 
/mnt/oneSIS-overlay/lowerdir/usr/bin/ping ; /usr/sbin/getcap 
/mnt/oneSIS-overlay/lowerdir/usr/bin/ping ; lctl dk 
/lfs/data0/system/log/dk.log.-1 ; lctl set_param debug='ioctl neterror warning 
error emerg ha config console lfsck'

/mnt/oneSIS-overlay/lowerdir is the lustre root filesystem image
(usually mounted read-only, but read-write for this debugging)

expected output is nothing for setcap.
expected output for getcap is
  # getcap /mnt/oneSIS-overlay/lowerdir/usr/bin/ping
  /mnt/oneSIS-overlay/lowerdir/usr/bin/ping = cap_net_admin,cap_net_raw+p
but actual output is nothing ->
  # getcap /mnt/oneSIS-overlay/lowerdir/usr/bin/ping
  #

to the copy of 'ping' on the tmpfs/overlayfs getcap/setcap works fine ->
  # getcap /usr/bin/ping
  /usr/bin/ping = cap_net_admin,cap_net_raw+p

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] seclabel

2017-03-02 Thread Robin Humble

Hiya,

I'm updating an image for a root-on-lustre cluster from centos6 to 7
and I've hit a little snag. I can't seem to mount lustre so that it
understands seclabel. ie. setcap/getcap don't work. the upshot is that
root can use ping (and a few other tools), but users can't.

any idea what I'm doing wrong?

from what little I understand about it I think seclabel is a form of
xattr.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

2016-04-13 Thread Robin Humble

Hi Mark,

On Tue, Apr 12, 2016 at 04:49:10PM -0400, Mark Hahn wrote:
>One of our MDSs is crashing with the following:
>
>BUG: unable to handle kernel paging request at deadbeef
>IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs]
>PGD 0
>Oops: 0002 [#1] SMP
>
>The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64
>with about 2k clients ranging from 1.8.8 to 2.6.0

I saw an identical crash in Sep 2014 when the MDS was put under memory
pressure.

>to be related to vm.zone_reclaim_mode=1.  We also enabled quotas

zone_reclaim_mode should always be 0. 1 is broken. hung processes
perpetually 'scanning' in one zone in /proc/zoneinfo whilst plenty of
pages are free in another zone is a sure sign of this issue.

however if you have vm.zone_reclaim_mode=0 now and are still seeing the
issue, then I would suspect that lustre's overly agresssive memory
affinity code is partially to blame. at the very least it is most
likely stopping you from making use of half your MDS ram.

see
  https://jira.hpdd.intel.com/browse/LU-5050

set
  options libcfs cpu_npartitions=1
to fix it. that's what I use on OSS and MDS for all my clusters.

cheers,
robin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [Lustre-discuss] ll_ost thread soft lockup

2012-03-19 Thread Robin Humble

On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:
You are running 1.8.5, which does not have the fix for the known MD raid5/6 
rebuild corruption bug.  That fix was released in the Oracle Lustre 1.8.7 
kernel patches.  Unless you already applied that patch, you might want to run 
a check of your raid arrays and consider an upgrade (at least patch your 
kernel with that fix).

md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note 
that this bug is NOT specific to rhel5).  This fix does NOT appear to have 
been picked up by whamcloud.

as you say, the md rebuild bug is in all kernels  2.6.32
  http://marc.info/?l=linux-raidm=130192650924540w=2

the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1.

I also asked RedHat nicely, and they added the same patch to RHEL5.8
kernels, which IMHO is the correct place for a fundamental md fix.

so once Lustre supports RHEL5.8 servers, then the patch in Lustre
isn't needed any more.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] exceedingly slow lstats

2012-01-21 Thread Robin Humble

On Fri, Jan 20, 2012 at 02:35:19PM -0800, John White wrote:
Well, I was reading the strace wrong anyway:
lstat(../403/a323, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.134326
getxattr(../403/a323, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP 
(Operation not supported) 0.18
lstat(../403/a330, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.158898
getxattr(../403/a330, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP 
(Operation not supported) 0.19
lstat(../403/a331, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.239466
getxattr(../403/a331, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP 
(Operation not supported) 0.12
lstat(../403/a332, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.130146
getxattr(../403/a332, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP 
(Operation not supported) 0.12

The getxattr takes an incredibly short amount of time, it's the lstat itself 
that's taking 0.1+s.  

it used to be that weird slowdowns and high load could be caused by
kernel zone_reclaim confusion, so firstly I'd suggest checking that
vm.zone_reclaim_mode=0 everywhere (clients and servers).

after that see if turning off read  write_through caches on OSS's
helps metadata rates. there's a fair chance that streaming i/o to OSS's
is filling OSS ram and pushing inodes/dentries out of OSS vfs cache
causing big metadata slowdowns - the more streaming i/o the greater the
slowdown.

if turning off the data caches fixes the problem for you (ie. it's not
faulty hardware or an old lustre version or something else) then there
are couple of different methods that could let you get both data
caching and good metadata rates, but first things first...

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

2011-02-09 Thread Robin Humble

rejoining this topic after a couple of weeks of experimentation

Re: trying to improve metadata performance -

we've been running with vfs_cache_pressure=0 on OSS's in production for
over a week now and it's improved our metadata performance by a large factor.

 - filesystem scans that didn't finish in ~30hrs now complete in a little
   over 3 hours. so ~10x speedup.

 - a recursive ls -altrR of my home dir (on a random uncached client) now
   runs at 2000 to 4000 files/s wheras before it could be 100 files/s.
   so 20 to 40x speedup.

of course vfs_cache_pressure=0 can be a DANGEROUS setting because
inodes/dentries will never be reclaimed, so OSS's could OOM.

however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
I expect many sites can (like us) easily cache everything. for a given
number of inodes per OST it's easily calculable whether there's enough
OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.

continued monitoring of the fs inode growth (== OSS slab size) over
time is very important as fs's will inevitably acrue more files...

sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful
at keeping stat rates high. sustained OSS cache memory pressure through
the day dropped enough inodes that nightly scans weren't fast any more.

our current residual issue with vfs_cache_pressure=0 is unexpected.
the number of OSS dentries appears to slowly grow over time :-/
it appears that some/many dentries for deleted files are not reclaimed
without some memory pressure.
any idea why that might be?

anyway, I've now added a few lines of code to create a different
(non-zero) vfs_cache_pressure knob for dentries. we'll see how that
goes...
an alternate (simpler) workaround would be to occasionally drop OSS
inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
and to just live with a day of slow stat's while the inode caches
repopulate.

hopefully vfs_cache_pressure=0 also has a net small positive impact on
regular i/o due to reduced iops to OSTs, but I haven't trid to measure
that.
slab didn't steal much ram from our read and write_through caches (we
have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the
additional cached inodes/dentries) so OSS file caching should be
almost unaffected.

On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:
On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
 limiting the total amount of OSS cache used in order to leave room for
 inodes/dentries might be more useful. the data cache will always fill
 up and push out inodes otherwise.

I disagree with myself now. I think mm/vmscan.c would probably still
call shrink_slab, so shrinkers would get called and some cached inodes
would get dropped.

The inode and dentry objects in the slab cache aren't so much of an issue as 
having the disk blocks that each are generated from available in the buffer 
cache. Constructing the in-memory inode and dentry objects is cheap as long as 
the corresponding disk blocks are available. Doing the disk reads, depending 
on your hardware and some other factors, is not.

on a test cluster (with read and write_through caches still active and
synthetic i/o load) I didn't see a big change in stat rate from
dropping OSS page/buffer cache - at most a slowdown for a client
'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is
because there is almost zero persistent buffer cache due to the OSS
buffer and page caches being punished by file i/o.
in the same testing, dropping OSS inode/dentry caches was a much larger
effect (up to 60x slowdown with synthetic i/o) - which is why the
vfs_cache_pressure setting works.
the synthetic i/o wasn't crazily intensive, but did have a working
set OSS mem which is likely true of our production machine.

however for your setup with OSS caches off, and from doing tests on our
MDS, I agree that buffer caches can be a big effect.

dropping our MDS buffer cache slows down a client 'lfs find' by ~4x,
but dropping inode/dentry caches doesn't slow it down at all, so
buffers are definitely important there.
happily we're not under any memory pressure on our MDS's at the
moment.

We went the extreme and disabled the OSS read cache (+ writethrough cache). In 
addition, on the OSSes we pre-read all of the inode blocks that contain at 
least one used inode, along with all of the directory blocks. 

The results have been promising so far. Firing off a du on an entire 
filesystem, 3000-6000 stats/second is typical. I've noted a few causes of 
slowdowns so far; there may be more.

we see about 2k files/s on the nightly sweeps now. that's with one
lfs find running and piping to parallel stat's. I think we can do
better with more parallelism in the finds, but 2k is so much better
than what it used to be we're fairly happy for now.

2k isn't as good as your stat rates, but we still have OSS caches on,
so the rest of our i/o should be benefiting from that.

When memory runs low on a client, kswapd

Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

2011-01-27 Thread Robin Humble

On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
 It would probably be better to set:

 lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M

 or similar, to limit the read cache to files 32MB in size or less (or 
 whatever you consider small files at your site.  That allows the read 
 cache for config files and such, while not thrashing the cache while 
 accessing large files.

 We should probably change this to be the default, but at the time the read 
 cache was introduced, we didn't know what should be considered a small vs. 
 large file, and the amount of RAM and number of OSTs on an OSS, and the uses 
 varies so much that it is difficult to pick a single correct value for this.

limiting the total amount of OSS cache used in order to leave room for
inodes/dentries might be more useful. the data cache will always fill
up and push out inodes otherwise.
Nathan's approach of turning off the caches entirely is extreme, but if
it gives us back some metadata performance then it might be worth it.

or is there a Lustre or VM setting to limit overall OSS cache size?

I presume that Lustre's OSS caches are subject to normal Linux VM
pagecache tweakables, but I don't think such a knob exists in Linux at
the moment...

I was looking through the Linux vm settings and saw vfs_cache_pressure - 
has anyone tested performance with this parameter? Do you know if this 
would this have any effect on file caching vs. ext4 metadata caching?

For us, Linux/Lustre would ideally push out data before the metadata, as 
the performance penalty for doing 4k reads on the s2a far outweighs any 
benefits of data caching.

good idea. if all inodes are always cached on OSS's then the fs should
be far more responsive to stat loads... 4k/inode shouldn't use up too
much of the OSS's ram (probably more like 1 or 2k/inode really).

anyway, following your idea, we tried vfs_cache_pressure=50 on our
OSS's a week or so ago, but hit this within a couple of hours
  https://bugzilla.lustre.org/show_bug.cgi?id=24401
could have been a coincidence I guess.

did anyone else give it a try?


BTW, we recently had the opposite problem on a client that scans the
filesystem - too many inodes were cached leading to low memory problems
on the client. we've had vfs_cache_pressure=150 set on that machine for
the last month or so and it seems to help. although a more effective
setting in this case was limiting ldlm locks. eg. from the Lustre manual
  lctl set_param ldlm.namespaces.*osc*.lru_size=1

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

2011-01-12 Thread Robin Humble

Hi Nathan,

On Thu, Jan 06, 2011 at 05:42:24PM -0700, nathan.dau...@noaa.gov wrote:
I am looking for more information regarding the size on MDS feature as
it exists for lustre-1.8.x.  Testing on our system (which started out as
1.6.6 and is now 1.8.x) indicates that there are many files which do not
have the size information stored on the MDT.  So, my basic question:
under what conditions will the size hint attribute be updated?  Is
there any way to force the MDT to query the OSTs and update it's
information?

atime (and the MDT size hint) wasn't being updated for most of the 1.8
series due to this bug:
  https://bugzilla.lustre.org/show_bug.cgi?id=23766
the atime fix is now in 1.8.5, but I'm not sure if anyone has verified
whether or not the MDT size hint is now behaving as originally intended.

actually, it was never clear to me what (if anything?) ever accessed
OBD_MD_FLSIZE...
does someone have a hacked 'lfs find' or similar tool?
your approach of mounting and searching a MDT snapshot should be
possible, but it would seem neater just to have a tool on a client send
the right rpc's to the MDS and get the information that way.

like you, we are finding that the timescales for our filesystem
trawling scripts are getting out of hand, mostly (we think) due to
retrieving size information from very busy OSTs. a tool that only hit
the MDT and found (filename, uid, gid, approx size) should help a lot.
so +1 on this topic.

BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT
size hints might be to read 4k from every file in the system. that
should update atime and the size hint. please let us know if this works.

The end goal of this is to facilitate efficient checks of disk usage on
a per-directory basis (essentially we want volume based quotas).  I'm

a possible approach for your situation would be to chgrp every file
under a directory to be the same gid, and then enable (un-enforcing)
group quotas on your filesystem. then you wouldn't have to search any
directories. you would still have to find and chgrp some files nightly,
but 'lfs find' should make that relatively quick.

unfortunately we also need a breakdown of the uid information in each
directory, so this approach isn't sufficient for us.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

hoping to run something once a day on the MDS like the following:
lvcreate  -s -p r -n mdt_snap /dev/mdt
mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap
cd /mnt/snap/ROOT
du --apparent-size ./*  volume_usage.log
cd /
umount /mnt/snap
lvremove /dev/mdt_snap
Since the data is going to be up to one day old anyway, I don't really
mind that the file size is approximate, but it does have to be
reasonably close.

With the MDT LVM snapshot method I can check the whole 300TB file system
in about 3 hours, whereas checking from a client takes weeks.

Here is why I am relatively certain that the size-on-MDS attributes are
not updated (lightly edited):

[r...@mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file
-rw-r--r-- 1  9000 0 Mar 23  2010
/mnt/snap/ROOT/test/rollover/user_acct_file
[r...@mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file
0   /mnt/snap/ROOT/test/rollover/user_acct_file
[r...@mds0 ~]# du --apparent-size
/mnt/snap/ROOT/test/rollover/user_acct_file
0   /mnt/snap/ROOT/test/rollover/user_acct_file

[r...@c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file
-rw-r--r-- 1 user group 184435207 Mar 23  2010
/mnt/lfs0/test/rollover/user_acct_file
[r...@c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file
180120  /mnt/lfs0/test/rollover/user_acct_file
[r...@c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file
180113  /mnt/lfs0/test/rollover/user_acct_file


Thanks very much for any answers or suggestions you can provide!

-Nathan

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Parallel fortran program bug.

2010-12-30 Thread Robin Humble

On Thu, Dec 23, 2010 at 03:48:50PM +0100, Roy Dragseth wrote:
On Thursday, December 23, 2010 15:18:13 Rick Grubin wrote:
  We have an occasional problem with parallel fortran programs that open
  files with status old or unknown returns errors on open.  This seems
 Sounds like bug 17545:  https://bugzilla.lustre.org/show_bug.cgi?id=17545
 The issue is fixed for v1.8.2 and beyond.
Thanks a lot for your quick reply!  This seems to be it, we will upgrade next 
week.

if you are using Intel Fortran, then I think your open() failures will
probably continue even with latest Lustre, but at a lower rate. see
  https://bugzilla.lustre.org/show_bug.cgi?id=23978

this bug has flown under the radar a bit as it causes fairly cryptic
app failures, and only Intel fortran hits it with any frequency.
what the user sees usually just looks like a failed open with an oddly
corrupted filename string.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

2010-10-19 Thread Robin Humble

Hi Jagga,

On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote:
..
start seeing this issue.  All my clients are setup with SLES11 and the same
packages with the exception of a newer kernel in the 1.8.4 environment due
to the lustre dependency:

reshpc208:~ # uname -a
Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 
x86_64 x86_64 GNU/Linux
...
open(/proc/9598/stat, O_RDONLY)   = 6
read(6, 9598 (gsnap) S 9596 9589 9589 0 ..., 1023) = 254
close(6)= 0
open(/proc/9598/status, O_RDONLY) = 6
read(6, Name:\tgsnap\nState:\tS (sleeping)\n..., 1023) = 1023
close(6)= 0
open(/proc/9598/cmdline, O_RDONLY)= 6
read(6,

did you get any further with this?

we've just seen something similar in that we had D state hung processes
and a strace of ps hung at the same place.

in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma'
magically caused all the D state processes to continue... we don't have
a good idea why this might be. looks kinda like a generic kernel shm
deadlock, possibly unrelated to Lustre.

sys_shmdt features in the hung process tracebacks that the kernel
prints out.

if you do 'lsof' do you see lots of /dev/shm entries for your app?
the app we saw run into trouble was using HPMPI which is common in
commercial packages. does gsnap use HPMPI?

we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on
this cluster.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

2010-10-13 Thread Robin Humble

On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote:
Doing a ps just hangs on the system and I need to just close and reopen a
session to the effected system.  The application (gsnap) is running from the
lustre filesystem and doing all IO to the lustre fs.  Here is a strace of
where ps hangs:

one possible cause of hung processes (that's not Lustre related) is the
VM tying itself in knots. are your clients NUMA machines?
is /proc/sys/vm/zone_reclaim_mode = 0?

I guess this explanation is a bit unlikely if your only change is the
client kernel version, but you don't say what you changed it from and
I'm not familiar with SLES, so the possibility is there, and it's an
easy fix (or actually a dodgy workaround) if that's the problem.

--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] ost's reporting full

2010-09-11 Thread Robin Humble

Hey Dr Stu,

On Sat, Sep 11, 2010 at 04:27:43PM +0800, Stuart Midgley wrote:
We are getting jobs that fail due to no space left on device.
BUT none of our lustre servers are full (as reported by lfs df -h on a client 
and by df -h on the oss's).
They are all close to being full, but are not actually full (still have ~300gb 
of space left)

sounds like a grant problem.

I've tried playing around with tune2fs -m {0,1,2,3} and tune2fs -r 1024 etc 
and nothing appears to help.
Anyone have a similar problem?  We are running 1.8.3

there are a couple of grant leaks that are fixed in 1.8.4 eg.
  https://bugzilla.lustre.org/show_bug.cgi?id=22755
or see the 1.8.4 release notes.

however the overall grant revoking problem is still unresolved AFAICT
  https://bugzilla.lustre.org/show_bug.cgi?id=12069
and you'll hit that issue more frequently with many clients and small
OSTs, or when any OST starts getting full.

in your case 300g per OST should be enough headroom unless you have
~4k clients now (assuming 32-64m grants per client), so it's probably
grant leaks. there's a recipe for adding up client grants and comparing
them to server grants to see if they've gone wrong in bz 22755.

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] SSD caching of MDT

2010-08-19 Thread Robin Humble

On Thu, Aug 19, 2010 at 01:29:37PM +0100, Gregory Matthews wrote:
Article by Jeff Layton:

http://www.linux-mag.com/id/7839

anyone have views on whether this sort of caching would be useful for 
the MDT? My feeling is that MDT reads are probably pretty random but 
writes might benefit...?

if you look at the tiny size of inodes in slabtop on an MDS you'll
see that all read ops for most fs's are probably 100% cached in ram
by a decent sized MDS. ie. once you have traversed all inodes of a fs
once, then likely the MDT's are a write-only media, and the ram of the
MDS is a faster iop machine than any SSD could ever be.

you are then left with a MDT workload of entirely small writes. that is
definitely not a SSD sweet spot - many SSDs will fragment badly and
slow down horrendously, which eg. JBODs of 15k rpm SAS disks will not do.
basically beware of cheap SSDs, possibly any SSD, and certainly any SSD
that isn't an Intel x25-e or better. the Marvell controller SSDs we
sadly have many of now, I would not inflict upon any MDT.

also, having experimented with ramdisk MDT's (not in production
obviously), it is clear that even this 'perfect' media doesn't solve
all Lustre iops problems. far from it. usually it just means that you
hit algorithmic or numa problems in Lustre MDS code, or (more likely)
the ops just flow onto the OSTs and those become the bottleneck instead.
basically ramdisk MDT speedups weren't big over even just say, 16 fast
FC or SAS disks. SSDs would be in-between if they were behaving
perfectly, which would require extensive testing to determine.

looking at it a different way, Lustre's statahead kinda works ok,
create's are (IIRC) batched so also scale ok, so delete's might be
the only workload left where the fastest MDT money can buy would get
you any significant benefit... probably not worth the spend for most
folks.

assuming for a moment that SSDs worked as they should, then other
Lustre related workloads for which SSDs might be suitable are external
journals for OSTs, md bitmaps, or (one day) perhaps ZFS intent logs.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] best practice for lustre clustre startup

2010-07-01 Thread Robin Humble

On Thu, Jul 01, 2010 at 11:17:31AM -0600, Kevin Van Maren wrote:
My (personal) opinion:

Lustre clients should always start (mount) automatically.

yup

Lustre servers should have their services started through heartbeat (or 
other HA package), if failover is possible (be sure to configure stonith).

IMHO that's a bad idea. servers should not start automatically.

my objections to automated mount/failover are not Lustre related, but
to all layers underneath - as Kevin well knows, mptsas drivers can and
do and have screwed up majorly and I'm sure other drivers have too. md
is far from smart, and disks are broken in such an infinite amount of
weird and wonderful ways that no driver or OS can reasonably be
expected to deal with them all :-/

if you have the simple setup of singly-attached storage and a Lustre
server just crashed, then why wouldn't it just crash again? we have had
that happen. automated startup seems silly in this case - especially if
you don't know what the problem was to start with. worst case is if the
hardware started corrupting data and crashed the machine, is it really
a good idea to reboot, remount, continue corrupting data more, and then
keep rebooting until dawn?

if you have a more elaborate Lustre setup with HA failover pairs then
the above applies, and additionally there are inherent races in both
nodes in a pair trying to mount a set of disks if you do not have a
third impartial member participating in a failover chorum - not a
common HA setup for Lustre, although it probably should be.
if a sw raid is assembled on both machines at the same time because of
a HA race, then it's likely data will be lost. Lustre mmp should save
you from multi-mounting the OST, but obviously not from corruption if
the underlying raid is pre-trashed.

overall without diagnosing why a machine crashed I fail to see how an
automated reboot or failover can possibly be a safe course of action.

cheers,
robin

If heartbeat starts automatically, do ensure auto-failback is NOT 
enabled: fail the resources back manually after you verify the rebooted 
server is healthy.
Whether heartbeat starts automatically seems to be a preference issue.

While unlikely, it is possible for an issue to cause Lustre to not start 
successfully, resulting in a node crash or other issue preventing a 
login.  So if it does start automatically you'll want to be prepared to 
reboot w/o Lustre (eg, single-user mode).

Kevin


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] soft lockups on NFS server/Lustre client

2009-10-18 Thread Robin Humble

On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote:
Hi List,

on our NFS server exporting our Lustre file system to a number of NFS 
clients, we've recently started to see kernel: BUG: soft lockup 
messages. As the locked processes include nfsd, our users are obviously 
not happy.

Around the time when the soft lockup occurs we also see a log of 
kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() 
messages, but I don't know if this is related.

probably not related. we were seeing this too (no NFS involved at all)
  https://bugzilla.lustre.org/show_bug.cgi?id=20904
and the upshot is that I'm pretty sure it's harmless and a RHEL bug.
I filed
  https://bugzilla.redhat.com/show_bug.cgi?id=526853
but it's probably being ignored. if you have a rhel support contract
maybe you can kick it along a bit...

dunno about your soft lockups. as I understand it soft lockups
themselves aren't harmful as long as they progress eventually.

Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS
exporter?

presumably soft lockups could also be saying your re-exporter or OSS's
are overloaded or that you have a slow disk or 3 in a RAID... without
NFS involved are all your OSTs up to speed?

do you still get problems after
  echo 60  /proc/sys/kernel/softlockup_thresh

cheers,
robin


We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS 
server/Lustre client with the lockups is running RHEL5.4 with an 
unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre 
modules from Sun.

See below for sample logs from the Lustre client/NFS server. I can 
provide more logs if required.

I'm not sure if this a Lustre issue but would appreciate if someone 
could help. We've not seen it on any other NFS server so far and there 
seems to be at least some lustre related stuff in the stack trace.

Is this a known issue and how can we avoid this? I have not found 
anything using google and the search on bugzilla.lustre.org. At least 
the BUG warning seems to be a known issue on this kernel.

I hope the logs below are readable enough, I tried to find entries where 
the stack traces don't overlap but this seems to be the best I can find.

Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed7d1] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed867] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed88e] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed97c] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ede6e] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [8005d28d] 
tracesys+0xd5/0xe0
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed7d1] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed867] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed88e] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ed97c] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [800ede6e] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck 
for 10s! [nfsd:1]
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat 
usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs 
fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) 
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 
xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec 
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp 
parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp
kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas 
mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd 
ohci_hcd ehci_hcd
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 1, comm: nfsd Tainted: 
G  2.6.18-92.1.10.el5 #1
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[80064ba7] 
  [80064ba7] .text.lock.spinlock+0x5/0x30
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:810044241ac8 
EFLAGS: 0286
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RAX: 81006cb6a1a8 RBX: 
81006cb6a178 RCX: 810044241b50
Oct  9 15:21:28 cs04r-sc-serv-07

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-15 Thread Robin Humble

On Thu, Sep 10, 2009 at 12:35:54PM +0200, Johann Lombardi wrote:
We have attached a new patch to bug 20560 which should address your
problem which may happen in rare cases with partial truncates.

as we are about to throw users onto the new system, can I ask for a
quick update pointing us to the current best guess at a workaround/fix
for the 1.8.1 read cache problems please?

to me it looks like
  https://bugzilla.lustre.org/show_bug.cgi?id=20560
is still evolving, but it looks like writethrough_cache=0 should now
work (and not crash the OSS) with attachment:
  https://bugzilla.lustre.org/attachment.cgi?id=25833

so if I patched our OSS's with just this one liner, then would that be
enough to run with until the situation has had some time to bed in?
or would we be better off with all 4 patches from 20560 applied (and
both read cache's still off)?

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] hacking max_sectors

2009-08-27 Thread Robin Humble

On Wed, Aug 26, 2009 at 04:11:12AM -0600, Andreas Dilger wrote:
On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
 with the patch, 1M i/o's are being fed to md (according to brw_stats),
 and performance is a little better for RAID6 8+2 with 128k chunks, and
 a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
 half 512k and half 1M i/o's by Lustre).
This was the other question I'd asked internally.  If the array is
formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify-
write operations and (in theory) give the same performance as 1M IOs on
a 128kB chunksize array.  What is the relative performance of the
64kB and 128kB configurations?

on these 1TB SATA RAID6 8+2's and external journals, with 1 client
writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md
patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that
128k chunk md gets 1M i/o's and 64k chunk md gets 512k i/o's then -

client max_rpcs_in_flight 8
 md chunkwrite (MB/s)read (MB/s)
 64k   185345
128k   235390

so 128k chunks are 10-30% quicker than 64k in this particular setup on
big streaming i/o tests (1G of 1M lmdd's).
having said that, 1.6.7.2 servers do better than 1.8.1 on some configs
(I haven't had time to figure out why) but the trend of 128k chunks
being faster than 64k chunks remains. also if the i/o load was messier
and involved smaller i/o's then 64k chunks might claw something back -
probably not enough though.

BTW, whilst we're on the topic - what does this part of brw_stats
mean?
 read  | write
  disk fragmented I/Os   ios   % cum % |  ios   % cum %
  1:5742 100 100   | 103186 100 100

this is for the 128k chunk case, where the rest of brw_stats says I'm
seeing 1M rpc's and 1M i/o's, but I'm not sure what '1' disk fragmented
i/o's means - should it be 0? or does '1' mean unfragmented?

sorry for packing too many questions into one email, but these slowish
SATA disks seem to need a lots of rpc's in flight for good performance.
32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good
magic combo. with that I get:

client max_rpcs_in_flight 32
 md chunkwrite (MB/s)read (MB/s)
 64k   275450
128k   395480

which is a lot faster...
with a heavier load of 20 clients hammering 4 OSS's each with 4 R6 8+2
OSTs I still see about a 10% advantage for clients with 32 rpcs.

is there a down side to running clients with max_rpcs_in_flight 32 ?
the initial production machine will be ~1500 clients and ~25 OSS's.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] hacking max_sectors

2009-08-25 Thread Robin Humble

Hiya,

I've had another go at fixing the problem I was seeing a few months ago:
  http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
and which we are seeing again now as we are setting up a new machine
with 128k chunk software raid (md) RAID6 8+2 eg.
  Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O 
requests: max_sectors = 1024 could be up to max_hw_sectors=2560 

I came up with the attached simple core kernel change which fixes the
problem, and seems stable enough under initial stress testing, but a
core scsi tweak seems a little drastic to me - is there a better way to
do it?

without this patch, and despite raising all disks to a ridiculously
huge max_sectors_kb, all Lustre 1M rpc's are still fragmented into two
512k chunks before being sent to md :-/ likely md then aggregates them
again 'cos performance isn't totaly dismal, which it would be if it was
100% read-modify-writes for each stripe write.

with the patch, 1M i/o's are being fed to md (according to brw_stats),
and performance is a little better for RAID6 8+2 with 128k chunks, and
a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
half 512k and half 1M i/o's by Lustre).

the one-liner is a core kernel change, so perhaps some Lustre/kernel
block device/md people can look at it and see if it's acceptable for
inclusion in standard Lustre OSS kernels, or whether it breaks
assumptions in the core scsi layer somehow.

IMHO the best solution would be to apply the patch, and then have a
/sys/block/md*/queue/ for md devices so that max_sectors_kb and
max_hw_sectors_kb can be tuned without recompiling the kernel...
is that possible?

the patch is against 2.6.18-128.1.14.el5-lustre1.8.1

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
--- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h   2009-08-18 
17:40:51.0 +1000
+++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 
13:47:55.0 +1000
@@ -778,7 +778,7 @@
 #define MAX_PHYS_SEGMENTS 128
 #define MAX_HW_SEGMENTS 128
 #define SAFE_MAX_SECTORS 255
-#define BLK_DEF_MAX_SECTORS 1024
+#define BLK_DEF_MAX_SECTORS 2048
 
 #define MAX_SEGMENT_SIZE   65536
 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre and kernel vulnerability CVE-2009-2692

2009-08-21 Thread Robin Humble

On Fri, Aug 21, 2009 at 06:41:01PM +0200, Thomas Roth wrote:
Hi all,

while trying to fix the recent kernel vulnerability (CVE-2009-2692) we
found that in most cases, our Lustre 1.6.5.1, 1.6.6 and 1.6.7.2 clients
seemed to be quite well protected, at least against the published
exploit: wunderbar_emporium seems to work, but then the root shell never
appears. Instead, the client freezes, requiring a reset.
Anybody else with such experiences?

no freezes here.
wunderbar_emporium didn't work against rhel/centos 2.6.18-128.4.1.el5
with patchless Lustre 1.6.7.2 after it was patched with the upstream
one-liner:
  
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e694958388c50148389b0e9b9e9e8945cf0f1b98

no idea if it was exploitable before or not - didn't try.

RedHat's view on this vulnerability is err, interesting... :-/
  http://kbase.redhat.com/faq/docs/DOC-18065
  https://bugzilla.redhat.com/show_bug.cgi?id=516949

Employing the recommended workaround by setting vm.mmap_min_addr to 4096

where did you see that recommended?

the RHEL based machines I've looked at have this set to 64k, but if they
are also running SELinux (which I presume few Lustre machines are?) then
they still might be vulnerable I guess.

cheers,
robin

blew up in our face: in particular machines with older kernels not
knowing about mmap_min_addr reacted quite irrationally, such as
segfaulting about every process running on the machine. Crazy things
that should not be possible 

Regards,
Thomas


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-21 Thread Robin Humble


I added this to bugzilla.
  https://bugzilla.lustre.org/show_bug.cgi?id=20227

cheers,
robin

On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote:
On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:
I get a ferocious set of error messages when I mount a 1.6.7.2
filesystem on a b_release_1_8_1 client.
is this expected?

just to annotate the below a bit in case it's not clear... sorry -
should have done that in the first email :-/

10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the
fs is mounted in the usual failover way eg.
  mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system

from the below (and other similar logs) it kinda looks like the client
fails and then renegotiates with all the servers.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: mgc10.8.30@o2ib: Reactivating import
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: Client system-client has started
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  ... last message repeated 17 times ...
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5

looks like it succeeds in the end, but only after a struggle.

I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2.

servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
quota fix).
client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
cvs 20090712131220 + bz18793 again.

BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
I'm confused about which is closest to the final 1.8.1 :-/

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] mds adjust qunit failed

2009-07-21 Thread Robin Humble

On Tue, Jul 21, 2009 at 01:50:43PM +0800, Lu Wang wrote:
Dear list,
   I have gotten over 19000 quota-related errors on one MDS since 18:00 
 yesterday like:

Jul 20 18:24:04 * kernel: LustreError: 
 10999:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! 
 (opc:4 rc:-122)

if you look through the Linux errno header files, you'll find
  -122 is EDQUOT/* Quota exceeded */
so someone or some group is over quota - either inodes or diskspace.

it would be really good if this message said which uid/gid was over
quota, and from which client, and on which filesystem.
as you have found, the current message is not very informative and
overly verbose.

I was looking at the quota code around this message a few days ago, and
it looks like it'd be really easy to add some extra info to the message
but I have yet to test a toy patch I wrote...

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility


   Jul 20 18:29:27 * kernel: LustreError: 
 11007:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! 
 (opc:4 rc:-122)
 

Jul 21 13:44:27 * kernel: LustreError: 
10999:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! 
(opc:4 rc:-122)

# grep master /var/log/messages  |wc
  19628  255058 2665136

  Dose any one know what does this mean?   The mds is running  on 
 2.6.9-67.0.22.EL_lustre.1.6.6smp. 



  
Best Regards
Lu Wang
--   
Computing Center
IHEP   Office: Computing Center,123 
19B Yuquan RoadTel: (+86) 10 88236012-607
P.O. Box 918-7 Fax: (+86) 10 8823 6839
Beijing 100049,China   Email: lu.w...@ihep.ac.cn   

-- 

  



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-15 Thread Robin Humble

I get a ferocious set of error messages when I mount a 1.6.7.2
filesystem on a b_release_1_8_1 client.
is this expected?

  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: mgc10.8.30@o2ib: Reactivating import
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: Client system-client has started
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  ... last message repeated 17 times ...
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 
4096
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
10.8.30@o2ib failed: 5

looks like it succeeds in the end, but only after a struggle.

I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2.

servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
quota fix).
client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
cvs 20090712131220 + bz18793 again.

BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
I'm confused about which is closest to the final 1.8.1 :-/

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recreate metada - possible?

2009-07-15 Thread Robin Humble

On Wed, Jul 15, 2009 at 08:32:27AM -0400, Brian J. Murrell wrote:
On Wed, 2009-07-15 at 10:53 +0200, Tom Woezel wrote:
 Now the partition table on the raiddevice got deleted and  
 cannot be recovered.
Ouch.  How did it get deleted?  How come it cannot be recovered?  A
partition table is nothing more than a small area at the start of a disk
that contains pointers (i.e. offsets on the disk) to where partitions
start and end.

if a kernel is still up and looking at the device then /proc/partitions
and /sys/block/disk/* might well still contain enough valid data from
which the previous partition table can be reconstructed.

been there, dd'd over that. (almost) all good in the end :-)
thankfully not to a Lustre fs, just my home server :-/

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Even if it was completely wiped, the process of scanning the entire disk
looking for signatures that can help identify a likely partition
beginning and then recreate the partition table is usually quite
successful.  You might want to look into the gpart tool for this.

 The OSTs are ok and the data on those should be  
 fine.

Yes.  But all you have is file contents, nothing else.

 Now here is my question is, it possible to create a new MGS and new  
 MDTs and somehow connect the old OSTs to them?

No.  There is nothing on the OSTs that indicate what file an object
belongs to.  This is why we are adamant about MDT storage being reliable
and backed up.

 Is there a way to  
 recreate the metadata with the data whis is held on the OSTs?

No.

 I'm deeply grateful for any help or hint on this issue.

Without knowing the whole story of your MDT/RAID saga, I'd say gpart is
your best bet.

b.




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-15 Thread Robin Humble

On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:
On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
 
   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: mgc10.8.30@o2ib: Reactivating import
   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: Client system-client has started
   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   ... last message repeated 17 times ...
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5

These are all LND errors.  What versions of OFED are you using on each
end?

all kernels all compiled with the rhel5 kernel tree's standard OFED.
I think 1.3.2 is what's in rhel5.3/centos5.3?

 looks like it succeeds in the end, but only after a struggle.
Is it completely stable and performant after the struggle?  Do the error
messages stop?

the fs's appear to be fine.

the error messages are just on the initial mount of the first lustre fs.
subsequent mounts of other lustre fs's don't get any messages, so it
seems like it's just an extremely noisy protocol/version negotiation
the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2
servers??

another data point is that the above errors don't happen with
2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
OFED, so it's probably something that's happened between 1.8.0.1 and
1.8.1-pre.
or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
2.6.18-128.1.16.el5, but that seems less likely.
I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...

 BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
 I'm confused about which is closest to the final 1.8.1 :-/

b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot
in time from the branch) which is getting tested from that branch which
has the potential to become 1.8.1 if the testing pans out.  It is
entirely possible that even when v1_8_1_RCn becomes the final release,
there will be patches dangling on the tip of b_release_1_8_1 that are
not release blockers but there in case we need a 1.8.1.1.

So the choice is yours.  If you want to be using exactly what could
potentially be the GA release, you should stick to using the most recent
tags.  If you want to test ahead of what could be the GA, use the branch
tip.

cool. thanks for the explanation.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-15 Thread Robin Humble

On Wed, Jul 15, 2009 at 11:59:54AM -0400, Brian J. Murrell wrote:
On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote:
 another data point is that the above errors don't happen with
 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
 OFED, so it's probably something that's happened between 1.8.0.1 and
 1.8.1-pre.
 or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
 2.6.18-128.1.16.el5, but that seems less likely.
 I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...
Yeah, it would be a great troubleshooting addition to see if the same
kernel on the clients and servers with the different lustre versions has
the same problem.  This would isolate the problem either to or away from
a problem with the difference in OFED stacks.

ok - I made a 2.6.18-128.1.14.el5 with b_release_1_8_1 and it behaves
the same as 2.6.18-128.1.16.el5 with b_release_1_8_1. ie. spits out a
bunch of errors on the first lustre mount.

the only changes between those rhel .14 and .16 versions looks pretty
unrelated to IB/lnet, so I guess that was to be expected:
  * Sat Jun 27 2009 Jiri Pirko jpi...@redhat.com [2.6.18-128.1.16.el5]
  - [mm] prevent panic in copy_hugetlb_page_range (Larry Woodman ) [508030 
507860]
  
  * Tue Jun 23 2009 Jiri Pirko jpi...@redhat.com [2.6.18-128.1.15.el5]
  - [mm] fix swap race condition in fork-gup-race patch (Andrea Arcangeli) 
[507297 506684]

so I guess the change is between Lustre 1.8.0.1 and
b_release_1_8_1-20090712131220 somewhere.
if only we had git bisect, and if only I knew how to use it, and if only
I had the time to try it... :-)

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-15 Thread Robin Humble

On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:
I get a ferocious set of error messages when I mount a 1.6.7.2
filesystem on a b_release_1_8_1 client.
is this expected?

just to annotate the below a bit in case it's not clear... sorry -
should have done that in the first email :-/

10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the
fs is mounted in the usual failover way eg.
  mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system

from the below (and other similar logs) it kinda looks like the client
fails and then renegotiates with all the servers.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: mgc10.8.30@o2ib: Reactivating import
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: Client system-client has started
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  ... last message repeated 17 times ...
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5

looks like it succeeds in the end, but only after a struggle.

I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2.

servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
quota fix).
client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
cvs 20090712131220 + bz18793 again.

BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
I'm confused about which is closest to the final 1.8.1 :-/

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Announce: Lustre 1.8.0.1 is available!

2009-06-23 Thread Robin Humble

On Mon, Jun 22, 2009 at 08:30:56PM -0700, Terry Rutledge wrote:
Hi all,

Lustre 1.8.0.1 is available on the Sun Download Center Site.

http://www.sun.com/software/products/lustre/get.jsp

the 1.8.0.1 download link on that page looks to be wrong... it should
point to 1801, but it points to 180. so currently the 1.8.0.1 page is
identical to the 1.8.0 page.

cheers,
robin

The change log and release notes can be read here:

http://wiki.lustre.org/index.php/Use:Change_Log_1.8

Thank you for your assistance; as always, you can report issues via
Bugzilla (https://bugzilla.lustre.org/)

Happy downloading!

-- The Lustre Team --

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

2009-05-14 Thread Robin Humble

 Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] no handle for file close

2009-05-10 Thread Robin Humble

On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu wrote:
I am getting quite a few errors similar to the following error on the 
MDS server which is running the latest 1.6.7.1 patched kernel. The 
clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel 
and this cluster has 130 nodes/Lustre clients and uses GigE network.

May  7 04:13:48 lustre3 kernel: LustreError: 
7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769: 
cookie 0xcfe66441310829d4  r...@8101ca8a3800 x2681218/t0 
o35-fedc91f9-4de7-c789-6bdd-1de1f5e3d...@net_0x2c0a8f109_uuid:0/0 lens 
296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0

May  7 04:13:48 lustre3 kernel: LustreError: 
7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116)  
r...@8101ca8a3800 x2681218/t0 
o35-fedc91f9-4de7-c789-6bdd-1de1f5e3d...@net_0x2c0a8f109_uuid:0/0 lens 
296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0

I don't see the same errors on another cluster/Lustre installation with 
2000 Lustre clients which uses Infiniband network.

we see this sometimes when a job that is using a shared library that
lives on Lustre is killed - presumably the un-memorymapping of the .so
from a bunch of nodes at once confuses Lustre a bit.

what is your inode 772769?
eg.
   find -inum 772769 /some/lustre/fs/
if the file is a .so then that would be similar to what we are seeing.

so we have this listed in the probably harmless section of the errors
that we get from Lustre, so if it's not harmless than we'd very much
like to know about it :)

this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless
1.6.4.3 on clients w/ 2.6.23.17 kernels.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

I looked at the following bugs 19328, 18946, 18192 and 19085 but I am 
not sure if any of those bugs apply to this error. I would appreciate it 
someone could help me understand these errors and possibly suggest the 
solution.

TIA
Nirmal
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] root on lustre and timeouts

2009-05-01 Thread Robin Humble

On Thu, Apr 30, 2009 at 12:51:00PM -0400, Brian J. Murrell wrote:
On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:
 BTW, as was pointed out in one talk of this years LUG, Lustre 1.8's
 OSS read cache should help things like root-on-Lustre because small
 commonly used files will likely be cached in the OSS's and won't result
 in disk accesses.
Yes, imagine what the ROSS cache can do for 150 clients all booting (and
executing the same scripts/binaries) at the same time.  Imagine what the
OSS disk did/does before the cache.  :-)

hopefully most of the frequently used parts of the OS are in page cache
on clients after the first read or two, but if there are new parts
accessed (or if everything boots at once) then yes, the OSS read cache
should definitely help lots.  

currently the only load we notice from root-on-Lustre is on the MDS,
but I can't say we've been actively monitoring and categorising all the
traffic - we really haven't felt the need because there haven't been
slowdowns to speak of - that's a good thing :)

actually, just thinking about it, it'd be good if you could tell Lustre
(llite) to be lazy about re-stat'ing files in what is mostly an
un-changing read-only image. is it possible to do this?

Certainly, I am not without bias, but the feature set of 1.8 looks
compelling enough to make me want to upgrade my own little dogfood
cluster here to 1.8.  :-)

yes, the features are shiny :)

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] root on lustre and timeouts

2009-04-29 Thread Robin Humble

we are (happily) using read-only root-on-Lustre in production with
oneSIS, but have noticed something odd...

if a root-on-Lustre client node has been up for more than 10 or 12hours
then it survives an MDS failure/failover/reboot event(*), but if the
client is newly rebooted and has been up for less than this time, then
it doesn't successfully reconnect after an MDS event and the node is
~dead.

by trial and error I've also found that if I rsync /lib64, /bin, and
/sbin from Lustre to a root ramdisk, 'echo 3  /proc/sys/vm/drop_caches',
and symlink the rest of dirs to Lustre then the node sails through MDS
events. leaving out any one of the dirs/steps leads to a dead node. so
it looks like the Lustre kernel's recovery process is somehow tied to
userspace via apps in /bin and /sbin?

I can reproduce the weird 10-12hr behaviour at will by changing the
clock on nodes in a toy Lustre test setup. ie.
 - servers and client all have the correct time
 - reboot client node
 - stop ntpd everywhere
 - use 'date --set ...' to set all clocks to be X hours in the future
 - cause a MDS event(*)
 - wait for recovery to complete
 - if X = ~10 to 12 then the client will be dead

it's no big deal to put those 3 dirs into ramdisk as they're really
small (and the part-on-ramdisk model is nice and flexible too), so
we'll probably move to running in this way anyway, but I'm still
curious as to why a kernel-only system like Lustre a) cares about
userspace at all during recovery b) why it has a 10-12hr timescale :-)

changing the contents of /proc/sys/lnet/upcall into some path stat'able
without Lustre being up doesn't change anything.

BTW, OSS reboot/failover is handled fine with root on Lustre, as are
regular (non-root on Lustre clients) - this behaviour seems to be
limited to the MDS/MGS failure when all/almost-all of the OS is on Lustre.

our setup is patchless 1.6.4.3 clients, 1.6.6 servers, rhel5.2/5.3
x86_64, but the behaviour seems the same with much newer Lustre too
eg. patched b_release_1_8_0.  

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

(*) umount mdt and mgs, lustre_rmmod, wait 10 mins, mount them again
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] tuning max_sectors

2009-04-17 Thread Robin Humble

On Fri, Apr 17, 2009 at 07:25:30AM -0400, Brian J. Murrell wrote:
On Fri, 2009-04-17 at 13:08 +0200, Götz Waschk wrote:
 Lustre: zn_atlas-OST: underlying device cciss/c1d0p1 should be tuned for 
 larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2048

we have a similar problem.
  Lustre: short-OST0001: underlying device md0 should be tuned for larger I/O 
requests: max_sectors = 1024 could be up to max_hw_sectors=1280

 What can I do?
IIRC, that's in reference to /sys/block/$device/queue/max_sectors_kb.
If you inspect that it should report 1024.  You can simply echo a new
value into that the way you can with /proc variables.

sadly, that sys entry doesn't exist:
  cat: /sys/block/md0/queue/max_sectors_kb: No such file or directory

do you have any other suggestions?
perhaps the devices below md need looking at?
they all report /sys/block/sd*/queue/max_sectors_kb == 512.
we have an md raid6 8+2.

uname -a
  Linux sox2 2.6.18-92.1.10.el5_lustre.1.6.6.fixR5 #2 SMP Wed Feb 4 16:58:30 
EST 2009 x86_64 x86_64 x86_64 GNU/Linux
(which is 1.6.6 + the patch from bz 15428 which is (I think) now in 1.6.7.1)

cat /proc/mdstat
...
md0 : active raid6 sdc[0] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] 
sde[2] sdd[1]
  5860595712 blocks level 6, 64k chunk, algorithm 2 [10/10] [UU]
in: 64205147 reads, 97489370 writes; out: 3730773413 reads, 
3281459807 writes
983790 in raid5d, 498868 out of stripes, 4280451425 handle 
called
reads: 0 for rmw, 709671189 for rcw. zcopy writes: 1573400576, 
copied writes: 20983045
0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out
0 expanding overlap

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] e2fsck

2009-02-25 Thread Robin Humble

On Sat, Feb 21, 2009 at 04:13:49PM -0700, Andreas Dilger wrote:
On Feb 21, 2009  01:09 -0500, Robin Humble wrote:
 On Fri, Feb 20, 2009 at 02:10:50PM -0700, Andreas Dilger wrote:
 On Feb 19, 2009  20:42 -0500, Robin Humble wrote:
  in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks
  count wrong and often a bitmap in a group that wants to be corrected.
  
  is this normal?
  or is it an ldiskfs or an e2fsck bug?
 
 Do you have the MMP feature enabled?
 
 no, MMP is off.
 
 there is a small chance that this is the first time the partitions have
 been fsck'd since MMP was turned off though - I can't be sure about that.

That would probably be the cause - the MMP function uses a single block,
and it needs to be freed by e2fsck when the feature is disabled.  We
should probably fix tune2fs to do this at the time MMP is turned off.

awesome diagnosis!

  # e2fsck -f /dev/md0
 e2fsck 1.40.11.sun1 (17-June-2008)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 short-OST: 13/366190592 files (7.7% non-contiguous), 22998875/1464758400 
blocks

  # tune2fs -O ^mmp /dev/md0
 tune2fs 1.40.11.sun1 (17-June-2008)

  # e2fsck -f /dev/md0
 e2fsck 1.40.11.sun1 (17-June-2008)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Free blocks count wrong for group #0 (31222, counted=31223).
 Fixy? yes
 
 Free blocks count wrong (1441759525, counted=1441759526).
 Fixy? yes
 
 short-OST: * FILE SYSTEM WAS MODIFIED *
 short-OST: 13/366190592 files (7.7% non-contiguous), 22998874/1464758400 
blocks

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] e2fsck

2009-02-20 Thread Robin Humble

On Fri, Feb 20, 2009 at 02:10:50PM -0700, Andreas Dilger wrote:
On Feb 19, 2009  20:42 -0500, Robin Humble wrote:
 in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks
 count wrong and often a bitmap in a group that wants to be corrected.
 
 is this normal?
 or is it an ldiskfs or an e2fsck bug?

Do you have the MMP feature enabled?

no, MMP is off.

there is a small chance that this is the first time the partitions have
been fsck'd since MMP was turned off though - I can't be sure about that.

we have MMP off because when e2fsck or tune2fs crashes (eg. out of
memory, or when tune2fs goes recursively looking for journal devices
that don't exist) then it makes the MMP'd partition unusable.

cheers,
robin


 rhel5 x86_64
 e2fsprogs-1.40.11.sun1-0redhat
 kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6
 
 cheers,
 robin
 
 [r...@sox2 ~]# e2fsck -f /dev/md5
 e2fsck 1.40.11.sun1 (17-June-2008)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Block bitmap differences:  -107639
 Fixy? yes
 
 Free blocks count wrong for group #3 (19179, counted=19180).
 Fixy? yes
 
 Free blocks count wrong (8199819, counted=8199820).
 Fixy? yes
 
 
 system-OST0001: * FILE SYSTEM WAS MODIFIED *
 system-OST0001: 133986/3055616 files (1.2% non-contiguous),
 4007188/12207008 blocks
 [r...@sox2 ~]# e2fsck -f /dev/md6
 e2fsck 1.40.11.sun1 (17-June-2008)
 home-OST0001: recovering journal
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Free blocks count wrong for group #3 (23432, counted=23433).
 Fixy? yes
 
 Free blocks count wrong (131098913, counted=131098914).
 Fixy? yes
 
 
 home-OST0001: * FILE SYSTEM WAS MODIFIED *
 home-OST0001: 26848/33513472 files (2.4% non-contiguous),
 2934270/134033184 blocks
 [r...@sox2 ~]# e2fsck -f /dev/md7
 e2fsck 1.40.11.sun1 (17-June-2008)
 apps-OST0001: recovering journal
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Free blocks count wrong for group #3 (23432, counted=23433).
 Fixy? yes
 
 Free blocks count wrong (34865220, counted=34865221).
 Fixy? yes
 
 
 apps-OST0001: * FILE SYSTEM WAS MODIFIED *
 apps-OST0001: 45904/9166848 files (3.9% non-contiguous),
 1794027/36659248 blocks
 [r...@sox2 ~]#
 [r...@sox2 ~]# e2fsck -f /dev/md15
 e2fsck 1.40.11.sun1 (17-June-2008)
 system-OST: recovering journal
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Free blocks count wrong for group #3 (20647, counted=20648).
 Fixy? yes
 
 Free blocks count wrong (8115827, counted=8115828).
 Fixy? yes
 
 
 system-OST: * FILE SYSTEM WAS MODIFIED *
 system-OST: 134002/3055616 files (1.2% non-contiguous),
 4091180/12207008 blocks
 [r...@sox2 ~]# e2fsck -f /dev/md16
 e2fsck 1.40.11.sun1 (17-June-2008)
 home-OST: recovering journal
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 
 home-OST: * FILE SYSTEM WAS MODIFIED *
 home-OST: 26831/33513472 files (2.1% non-contiguous),
 2951394/134033184 blocks
 [r...@sox2 ~]# e2fsck -f /dev/md17
 e2fsck 1.40.11.sun1 (17-June-2008)
 apps-OST: recovering journal
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 Free blocks count wrong for group #3 (3046, counted=3047).
 Fixy? yes
 
 Free blocks count wrong (34976431, counted=34976432).
 Fixy? yes
 
 
 apps-OST: * FILE SYSTEM WAS MODIFIED *
 apps-OST: 45798/9166848 files (3.7% non-contiguous),
 1682816/36659248 blocks
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] e2fsck

2009-02-19 Thread Robin Humble

in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks
count wrong and often a bitmap in a group that wants to be corrected.

is this normal?
or is it an ldiskfs or an e2fsck bug?

rhel5 x86_64
e2fsprogs-1.40.11.sun1-0redhat
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6

cheers,
robin

[r...@sox2 ~]# e2fsck -f /dev/md5
e2fsck 1.40.11.sun1 (17-June-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -107639
Fixy? yes

Free blocks count wrong for group #3 (19179, counted=19180).
Fixy? yes

Free blocks count wrong (8199819, counted=8199820).
Fixy? yes


system-OST0001: * FILE SYSTEM WAS MODIFIED *
system-OST0001: 133986/3055616 files (1.2% non-contiguous),
4007188/12207008 blocks
[r...@sox2 ~]# e2fsck -f /dev/md6
e2fsck 1.40.11.sun1 (17-June-2008)
home-OST0001: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #3 (23432, counted=23433).
Fixy? yes

Free blocks count wrong (131098913, counted=131098914).
Fixy? yes


home-OST0001: * FILE SYSTEM WAS MODIFIED *
home-OST0001: 26848/33513472 files (2.4% non-contiguous),
2934270/134033184 blocks
[r...@sox2 ~]# e2fsck -f /dev/md7
e2fsck 1.40.11.sun1 (17-June-2008)
apps-OST0001: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #3 (23432, counted=23433).
Fixy? yes

Free blocks count wrong (34865220, counted=34865221).
Fixy? yes


apps-OST0001: * FILE SYSTEM WAS MODIFIED *
apps-OST0001: 45904/9166848 files (3.9% non-contiguous),
1794027/36659248 blocks
[r...@sox2 ~]#
[r...@sox2 ~]# e2fsck -f /dev/md15
e2fsck 1.40.11.sun1 (17-June-2008)
system-OST: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #3 (20647, counted=20648).
Fixy? yes

Free blocks count wrong (8115827, counted=8115828).
Fixy? yes


system-OST: * FILE SYSTEM WAS MODIFIED *
system-OST: 134002/3055616 files (1.2% non-contiguous),
4091180/12207008 blocks
[r...@sox2 ~]# e2fsck -f /dev/md16
e2fsck 1.40.11.sun1 (17-June-2008)
home-OST: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

home-OST: * FILE SYSTEM WAS MODIFIED *
home-OST: 26831/33513472 files (2.1% non-contiguous),
2951394/134033184 blocks
[r...@sox2 ~]# e2fsck -f /dev/md17
e2fsck 1.40.11.sun1 (17-June-2008)
apps-OST: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #3 (3046, counted=3047).
Fixy? yes

Free blocks count wrong (34976431, counted=34976432).
Fixy? yes


apps-OST: * FILE SYSTEM WAS MODIFIED *
apps-OST: 45798/9166848 files (3.7% non-contiguous),
1682816/36659248 blocks

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] speedy server shutdown

2009-02-08 Thread Robin Humble

Hi,

when shutting down our OSS's and then MDS's we often wait 330s for each
set of umount's to finish eg.
  Feb  2 03:20:06 xemds2 kernel: Lustre: Mount still busy with 68 refs, waiting 
for 330 secs...
  Feb  2 03:20:11 xemds2 kernel: Lustre: Mount still busy with 68 refs, waiting 
for 325 secs...
  ...
is there a way to speed this up?

we're interested in the (perhaps unusual) case where all clients are gone
because the power has failed, and the Lustre servers are running on UPS
and need to be shut down ASAP.

the tangible reward for a quick shutdown is that we can buy a lower
capacity (cheaper) UPS if we can reliably and cleanly shutdown all the
Lustre servers in 10mins, and preferably 3 minutes. if we're tweaking
timeouts to do this then hopefully we can tweak them just before the
shutdown and avoid running short timeouts in normal operation.

I'm probably missing something obvious, but I have looked through a
bunch of /proc/{fs/lustre,sys/lnet,sys/lustre} entries and the
Operations Manual and I can't actually see where the default 330s comes
from... ???
it seems to be quite repeatable for both OSS's and MDS's.

we're using Lustre 1.6.6 or 1.6.5.1 on servers and patchless 1.6.4.3 on
clients with x86_64 RHEL 5.2 everywhere.
thanks for any help!

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] open() ENOENT bug

2008-11-02 Thread Robin Humble

On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom wrote:
On Thursday 30 October 2008, Brian J. Murrell wrote:
 On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
  we have a user with simultaneously starting fortran runs that fail
  about 10% of the time because Lustre sometimes returns ENOENT instead
  of EACCES to an open() request on a read-only file.

 I can reproduce this on 1.6.6 as well using your reproducer.

We have also seen this bug on our systems (reported by a user running a 
Fortran code). We have servers with both 1.4 
(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.

The error is seen towards both server versions from a cluster with patchless 
1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).

However the error is not seen from another cluster running _patched_ 1.6.5.1 
on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).

I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who'd have
thought it'd boot with a RHEL5 userland!? :-) and you are right - my
openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot
more robust than 2.6.18 and onwards.

however, when running ~10 copies of the below fortran code with the
above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always
die with:
  Fortran runtime error: Stale NFS file handle

  program blah
  implicit none
  integer i
  do i=1,1000
  open(3,file='file',status='old')
  close(3)
  enddo
  stop
  end

so although my cut-down C code reproducer doesn't trigger anything, it
seems Lustre still has issues with the real fortran code. the user's
jobs would probably run ok in this RHEL4 environment though as they
don't run 10 copies at once.
it's a slightly different variant of the bug as well (different error
code), or maybe it's just a totaly different bug.

cheers,
robin




/Peter

 Can you file a bug in our bugzilla about it?  Please include your
 reproducer program.

 b.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] open() ENOENT bug

2008-10-30 Thread Robin Humble

On Thu, Oct 30, 2008 at 08:28:05AM -0400, Brian J. Murrell wrote:
On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
 we have a user with simultaneously starting fortran runs that fail
 about 10% of the time because Lustre sometimes returns ENOENT instead
 of EACCES to an open() request on a read-only file.
I can reproduce this on 1.6.6 as well using your reproducer.

thanks for looking into it so quickly.

Can you file a bug in our bugzilla about it?  Please include your
reproducer program.

https://bugzilla.lustre.org/show_bug.cgi?id=17545

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] raid5 patches for rhel5

2008-08-07 Thread Robin Humble

On Fri, Aug 01, 2008 at 01:51:36PM -0600, Andreas Dilger wrote:
On Aug 01, 2008  09:38 -0400, Robin Humble wrote:
 done, and yes, performance is largely the same as RHEL4. cool!
 
 Version  1.03   --Sequential Output-- --Sequential Input- 
 --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
 --Seeks--
 Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
 %CP
 rhel4 oss  16G:256k 84624  99 842138 92 310044 91 77675  99 491239 96 285.8  
 10
 rhel5 oss  16G:256k 86085  99 827731 95 327007 97 79639 100 495487 98 456.2  
 18
 
 streaming writes are down marginally on rhel5, but seeks/s are up 50%.
Good to know, thanks.

 BTW - the above is with 1.6.4.3 clients.
Is this with 1.6.5 servers or 1.6.4.3 servers?

that's with 1.6.5.1 RHEL5 servers.

 1.6.5.1 client still perform badly for us. eg.
Have you tried disabling the checksums?
   lctl set_param osc.*.checksums=0

yes, checksums were disabled.

Note that 1.6.5 clients - 1.6.5 servers with checksums enabled will perform
better than mixed client/server because 1.6.5 has a more efficient checksum
algorithm.

 Version  1.03   --Sequential Output-- --Sequential Input- 
 --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
 --Seeks--
 Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
 %CP
16G:256k 77216  99 462659 100 296050  96 68100  81 648350  93 
 422.2  13
 
 which shows better streaming writes, but ~1/2 the streaming read speed :-(
You are getting that backward... 55% of the previous write speed,
90% of the previous overwrite speed, and 130% of the previous read speed.

doh! yes, backwards...
that was patchless 2.6.23 clients BTW.

  Note that there are also similar
 performance improvements for RAID-6.
 I can't see the RAID6 patches in the tree for RHEL5... am I missing
 something?
Sigh, RAID6 patches were ported to RHEL4, but not RHEL5...  I've filed
bug 16587 about that, but have no idea when it will be completed.

cool - thanks!

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] raid5 patches for rhel5

2008-08-01 Thread Robin Humble

On Sun, Jul 20, 2008 at 11:08:41PM -0600, Andreas Dilger wrote:
On Jul 18, 2008  08:39 -0400, Robin Humble wrote:
 I notice that Lustre 1.6.5 brings with it the md layer RAID5 patches
 for RHEL5 kernels. thanks! :-)
 are all the RHEL4 optimisations there, so we should get the same
 performance if we now move our OSS's to RHEL5?
That is my understanding, yes.

done, and yes, performance is largely the same as RHEL4. cool!

Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
rhel4 oss  16G:256k 84624  99 842138 92 310044 91 77675  99 491239 96 285.8  10
rhel5 oss  16G:256k 86085  99 827731 95 327007 97 79639 100 495487 98 456.2  18

streaming writes are down marginally on rhel5, but seeks/s are up 50%.

BTW - the above is with 1.6.4.3 clients.
1.6.5.1 client still perform badly for us. eg.

Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
   16G:256k 77216  99 462659 100 296050  96 68100  81 648350  93 422.2  
13

which shows better streaming writes, but ~1/2 the streaming read speed :-(

 Note that there are also similar
performance improvements for RAID-6.

I can't see the RAID6 patches in the tree for RHEL5... am I missing
something?

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.6.5.1 OSS crashes

2008-07-25 Thread Robin Humble

On Sun, Jul 20, 2008 at 08:40:19AM -0400, Mag Gam wrote:
I am trying to understand. What was the problem? How does SD_IOSTATS
affect the crash? How did you disable this?

the comments describe the bug:
  https://bugzilla.lustre.org/show_bug.cgi?id=16404#c22
which from a quick look seems like a SMP locking issue around the
statistics collection issue that presumable under some circumstances
can cause an overflow and a crash.

the way to disable it is to rebuild the patched-by-Lustre RHEL kernel
with the CONFIG_SD_IOSTATS option turned off.

Sorry for a newbie question

no probs.
let me know if you need a recipe for patching and rebuilding this
kernel. I should really write it all down before I forget anyway...

there are most likely descriptions for patching and building kernels on
the Lustre wiki too.

cheers,
robin



On Sun, Jul 20, 2008 at 4:54 AM, Robin Humble
[EMAIL PROTECTED] wrote:
 On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote:
On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote:
 Hi,

 I'm seeing coordinated OSS crashes with Lustre 1.6.5.1.

 our RHEL4 OSS have been stable for ~months with these kernels:
   kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3
   kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2

 but have crashed hard, twice, about 10hrs apart as soon as we started
 using this kernel:
   kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1
Can you try rebuilding the kernel, disabling SD_IOSTATS?

 done. I rebuilt using the stock kernel's InfiniBand stack and
  # CONFIG_SD_IOSTATS is not set

  % cexec -p oss: uptime
 oss x17:  18:45:07 up 1 day, 30 min,  1 user,  load average: 4.97, 7.00, 6.27
 oss x18:  18:45:07 up 1 day, 23 min,  1 user,  load average: 4.18, 5.78, 5.71
 oss x19:  18:45:07 up 1 day, 23 min,  1 user,  load average: 5.18, 5.66, 4.60

 which is  the 10hrs it was crashing at before.
 good guess about the cause of the problem! :-)

 maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a
 fairly critical issue... :-/

 cheers,
 robin
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.6.5.1 OSS crashes

2008-07-20 Thread Robin Humble

On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote:
On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote:
 Hi,
 
 I'm seeing coordinated OSS crashes with Lustre 1.6.5.1.
 
 our RHEL4 OSS have been stable for ~months with these kernels:
   kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3
   kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2
 
 but have crashed hard, twice, about 10hrs apart as soon as we started
 using this kernel:
   kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1
Can you try rebuilding the kernel, disabling SD_IOSTATS?

done. I rebuilt using the stock kernel's InfiniBand stack and
 # CONFIG_SD_IOSTATS is not set

 % cexec -p oss: uptime
oss x17:  18:45:07 up 1 day, 30 min,  1 user,  load average: 4.97, 7.00, 6.27
oss x18:  18:45:07 up 1 day, 23 min,  1 user,  load average: 4.18, 5.78, 5.71
oss x19:  18:45:07 up 1 day, 23 min,  1 user,  load average: 5.18, 5.66, 4.60

which is  the 10hrs it was crashing at before.
good guess about the cause of the problem! :-)

maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a
fairly critical issue... :-/

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Looping in __d_lookup

2008-05-21 Thread Robin Humble

On Tue, May 20, 2008 at 10:24:25PM +0200, Jakob Goldbach wrote:
 Hm, so you actually have a circular loop? 
Yes - I've asked for help on the OpenVZ list as well - Pavel Emelyanov
provided me with a debug patch. This patch has now confirmed the loop in
__d_lookup.

we're also seeing __d_lookup soft lockups. 4 are attached.

one was a cached_lookup but the first line of that fn is a __d_lookup
which is where I suspect is where the real soft lockup occurred. after a
while the node is toast and has to be rebooted.

kernel is 2.6.23.17 with patchless lustre 1.6.4.3, modified with
patches from bz 14250 (attachment 14109) and 13378 (attachment 12276).
using o2ib, x86_64, rhel5.1

cheers,
robin

ps. I'm glad to see some Lustre support for modern linux kernels and
would like to see a lot more! the VM's of distro kernel sometimes
behave erratically enough that we can't use them on Lustre client nodes
in production.
May  6 14:36:53 x6 kernel: BUG: soft lockup - CPU#2 stuck for 11s! 
[rsync:14200] 
May  6 14:36:53 x6 kernel: CPU 2: 
May  6 14:36:53 x6 kernel: Modules linked in: ext3 jbd loop osc mgc lustre lov 
lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs  rdma_ucm ib_ucm rdma_cm 
iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc 
dm_mirror dm_multipath dm_mod
battery ac sg sd_mod i2c_i801 i2c_core ahci ata_piix libata ib_mthca scsi_mod 
i5000_edac ib_mad edac_core ib_core ehci_hcd  shpchp uhci_hcd rng_core button 
nfs nfs_acl lockd sunrpc e1000
May  6 14:36:53 x6 kernel: Pid: 14200, comm: rsync Not tainted 2.6.23.17 #1 
May  6 14:36:53 x6 kernel: RIP: 0010:[80282cef]  [80282cef] 
__d_lookup+0xed/0x110 
May  6 14:36:53 x6 kernel: RSP: 0018:81011b7ebbe8  EFLAGS: 0286 
May  6 14:36:53 x6 kernel: RAX: 81011d71aea0 RBX: 81011d71aea0 RCX: 
0014 
May  6 14:36:53 x6 kernel: RDX: 000cb941 RSI: 81011b7ebcb8 RDI: 
810120522cb8 
May  6 14:36:53 x6 kernel: RBP: 8101d2707408 R08: 0007 R09: 
0007 
May  6 14:36:53 x6 kernel: R10: 7fffb750 R11: 802c7d3e R12: 
8853dc60 
May  6 14:36:53 x6 kernel: R13: 8100992a5470 R14: 81024adb3af8 R15: 
810219defe80 
May  6 14:36:53 x6 kernel: FS:  2b6296e0() 
GS:81025fc6d840() knlGS: 
May  6 14:36:53 x6 kernel: CS:  0010 DS:  ES:  CR0: 8005003b 
May  6 14:36:53 x6 kernel: CR2: 0079f018 CR3: 00018c1d CR4: 
06e0 
May  6 14:36:53 x6 kernel: DR0:  DR1:  DR2: 
 
May  6 14:36:54 x6 kernel: DR3:  DR6: 0ff0 DR7: 
0400 
May  6 14:36:54 x6 kernel:  
May  6 14:36:54 x6 kernel: Call Trace: 
May  6 14:36:54 x6 kernel:  [80282cd5] __d_lookup+0xd3/0x110 
May  6 14:36:54 x6 kernel:  [80279b1e] do_lookup+0x2a/0x1ae 
May  6 14:36:54 x6 kernel:  [8027bc4d] __link_path_walk+0x924/0xde9 
May  6 14:36:54 x6 kernel:  [8027c16a] link_path_walk+0x58/0xe0 
May  6 14:36:54 x6 kernel:  [8027c536] do_path_lookup+0x1ab/0x1cf 
May  6 14:36:54 x6 kernel:  [8027afe4] getname+0x14c/0x191 
May  6 14:36:54 x6 kernel:  [8027cd67] __user_walk_fd+0x37/0x53 
May  6 14:36:54 x6 kernel:  [80275a7b] vfs_lstat_fd+0x18/0x47 
May  6 14:36:54 x6 kernel:  [80275c6d] sys_newlstat+0x19/0x31 
May  6 14:36:54 x6 kernel:  [8020b3ae] system_call+0x7e/0x83

May  8 09:32:36 x10 kernel: BUG: soft lockup - CPU#1 stuck for 11s! 
[bonnie++.mpi:31720] 
May  8 09:32:36 x10 kernel: CPU 1: 
May  8 09:32:36 x10 kernel: Modules linked in: loop osc mgc lustre lov lquota 
mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm 
ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror 
dm_multipath dm_mod battery
ac sg sd_mod i5000_edac edac_core ehci_hcd ahci rng_core ata_piix libata 
i2c_i801 scsi_mod i2c_core ib_mthca ib_mad shpchp uhci_hcd ib_core button nfs 
nfs_acl lockd sunrpc e1000 
May  8 09:32:36 x10 kernel: Pid: 31720, comm: bonnie++.mpi Not tainted 
2.6.23.17 #1 
May  8 09:32:36 x10 kernel: RIP: 0010:[80282cef]  
[80282cef] __d_lookup+0xed/0x110 
May  8 09:32:36 x10 kernel: RSP: 0018:81014248fd78  EFLAGS: 0286 
May  8 09:32:36 x10 kernel: RAX: 8101a79cc500 RBX: 8101a79cc500 RCX: 
0014 
May  8 09:32:36 x10 kernel: RDX: 000ac16f RSI: 81014248feb8 RDI: 
8101ff617898 
May  8 09:32:36 x10 kernel: RBP: 8101ff617898 R08:  R09: 
81025fd3d5c0 
May  8 09:32:37 x10 kernel: R10: 0001 R11: 802c7d3e R12: 
8027c1e0 
May  8 09:32:37 x10 kernel: R13: 81014248fea8 R14:  R15: 
81025fd3d5c0 
May  8 09:32:37 x10 kernel: FS:  2c95a990() 
GS:81025fc6de40() knlGS: 
May  8 09:32:37 x10 kernel: CS:  0010 DS:  ES:  CR0: 8005003b 
May  8 09:32:37 x10 kernel: CR2:

Re: [Lustre-discuss] Looping in __d_lookup

2008-05-21 Thread Robin Humble

On Wed, May 21, 2008 at 08:04:27PM -0600, Andreas Dilger wrote:
On May 21, 2008  21:05 +0200, Jakob Goldbach wrote:
 I'm running 1.6.4.3 patchless as well against an 2.6.18 vanilla kernel.
 Or at least that is what I thought. OpenVz patch effectively makes the
 kernel a 2.6.18++ kernel because they add features from newer kernels in
 their maintained 2.6.18 based kernel.  
 
 So the lockup in __d_lookup may just relate to newer patchless clients. 
 
 I got a debug patch from the OpenVz community which indicate dcache
 chain corruption in a lustre code path. 

Do you have the fixes for the statahead patches, disable statahead via
echo 0  /proc/fs/lustre/llite/*/statahead_max, or can you try out the
v1_6_5_RC4 tag from CVS (which also contains those patches)?

all of our __d_lookup soft lockups have been when running with 0 in
/proc/fs/lustre/llite/*/statahead_max

I'll try v1_6_5_RC4 now - should be fun :-)

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] client randomly evicted

2008-05-15 Thread Robin Humble

On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote:
On Apr 30, 2008  11:40 -0400, Aaron Knister wrote:
 Some more information that might be helpful. There is a particular code 
 that one of our users runs. Personally after the trouble this code has 
 caused us we'd like to hand him a calculator and disable his accounts but 
 sadly that's not an option. Since the time of the hang, there is what seems 
 to be one process associated with lustre that is running as the userid of 
 the problem user- ll_sa_15530. A trace of this process in its current 
 state shows this -

 Is this a problem with the lustre readahead code? If so would this fix it? 
 echo 0  /proc/fs/lustre/llite/*/statahead_count 

Yes, this appears to be a statahead problem.  There were fixes added to
1.6.5 that should resolve the problems seen with statahead.  In the meantime
I'd recommend disabling it as you suggest above.

we're seeing the same problem.

I think the workaround should be:
  echo 0  /proc/fs/lustre/llite/*/statahead_max
??

/proc/fs/lustre/llite/*/statahead_count is -r--r--r--

cheers,
robin

ps. sorry I've been too busy this week to look at the llite_lloop stuff.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

2008-05-09 Thread Robin Humble

On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger wrote:
On Apr 29, 2008  09:05 -0700, Kilian CAVALOTTI wrote:
 /scratch # swapon -a ./swapfile
 swapon: ./swapfile: Invalid argument

Note that in 1.6.4+ there is an interface to export a block device
more directly from Lustre instead of using the loopback driver on top
of the client filesystem.  This is the llite_loop module and is
configured like:

   lctl blockdev_attach {loopback_filename} {blockdev_filename}

where {loopback_filename} is the file that should be turned into a
block device (it can be sparse if desired) and {blockdev_filename}
is the full filename that the new block device should be created at.

cool! I was wondering that that module was for.
I'm trying to use it like:
  lctl blockdev_attach /dev/lloop0 /some/file/on/lustre
but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file
goes at = 100MB/s. am I doing something wrong?

cheers,
robin

To clean up the device use:

   lctl blockdev_detach {blockdev_filename}

Note that using this block device for swap hasn't been very successful
in our testing so far, but we also haven't done a great deal of real
world testing only allocate a ton of RAM and dirty it all as fast
as possible, which isn't a very realistic usage.  Feedback would
be welcome.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

60 matches

Mail list logo