Re: [lustre-discuss] Migrating files doesn't free space on the OST
On Wed, Jan 16, 2019 at 04:25:25PM +, Jason Williams wrote: >I am trying to migrate files I know are not in use off of the full OST that I >have using lfs migrate. I have verified up and down that the files I am >moving are on that OST and that after the migrate lfs getstripe indeed shows >they are no longer on that OST since it's disabled in the MDS. > >The problem is, the used space on the OST is not going down. > >I see one of at least two issues: > >- the OST is just not freeing the space for some reason or another ( I don't >know) if you are using an older Lustre version (eg. IEEL) then you may have to re-enable the OST on the MDS to allow deletes to occur on the OST. then check no new files went there while it was enabled, and possibly loop and repeat. the newer ways of disabling file creation on OSTs in recent Lustre versions don't have this problem. >- Or someone is writing to existing files just as fast as I am clearing the >data (possible, but kind of hard to find) > >Is there possibly something else I am missing? Also, does anyone know a good >way to see if some client is writing to that OST and determine who it is if >it's more probable that that is what is going on? perhaps check 'lsof' on every client. if a client has a file open then it can't be deleted. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre Sizing
On Tue, Jan 01, 2019 at 01:05:22PM +0530, ANS wrote: >So what could be the reason for this variation of the size. with our ZFS 0.7.9 + Lustre 2.10.6 the "lfs df" numbers seem to be the same as those from "zfs list" (not "zpool list"). so I think your question is more about ZFS than Lustre. the number of devices in each ZFS vdev, raid level, what size files you write, ashift, recordsize, ... all will affect the total space available. see attached for an example. ZFS's space estimates are also pessimistic as it doesn't know what size files are going to be written. if you want more accurate numbers then perhaps create a small but realistic zpool and zfs filesystem (using say, 1G files as devices) and then fill it up with files representative of your workload and see how many fit on. I just filled them up with large dd's to make the above graph, so YMMV. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre traffic slow on OPA fabric network
Hi Kurt, On Thu, Jul 12, 2018 at 02:36:49PM -0400, Kurt Strosahl wrote: > That's really helpful. The version on the servers is IEEL 2.5.42, while > the routers and OPA nodes are all running 2.10.4... We'be been looking at > upgrading our old system to 2.10 or 2.11. just an update on this. we moved our old 2.5 IEEL lustre to 2.10.4 (still rhel6.x) but sadly it didn't solve our lnet routing problem. sorry for the bad advice. > I checked the opa clients and the lnet routers, they all use the same > parameters that you do except for the map_on_demand (which our system > defaults to 256). we eventually realised that with the "new" ways of setting ko2iblnd and lnet options we could configure each card (qib/mlnx, opa) separately and have them "optimal", but still doesn't work without errors so far. haven't 100% ruled out shonky FINSTAR opa optical cables yet, but it seems quite unlikely. did you make any progress? cheers, robin > >w/r, >Kurt > >- Original Message - >From: "Robin Humble" >To: "Kurt Strosahl" >Cc: lustre-discuss@lists.lustre.org >Sent: Tuesday, July 10, 2018 5:03:30 AM >Subject: Re: [lustre-discuss] Lustre traffic slow on OPA fabric network > >Hi Kurt, > >On Tue, Jul 03, 2018 at 02:59:22PM -0400, Kurt Strosahl wrote: >> I've been seeing a great deal of slowness from clients on an OPA network >> accessing lustre through lnet routers. The nodes take very long to complete >> things like lfs df, and show lots of dropped / reestablished connections. >> The OSS systems show this as well, and occasionally will report that all >> routes are down to a host on the omnipath fabric. They also show large >> numbers of bulk callback errors. The lnet router show large numbers of >> PUT_NACK messages, as well as Abort reconnection messages for nodes on the >> OPA fabric. > >I don't suppose you're talking to a super-old Lustre version via the >lnet routers? > >we see excellent performance OPA to IB via lnet routers wth 2.10.x >clients and 2.9 servers, but when we try to talk to a IEEL 2.5.41 >servers then we see pretty much exactly the symptoms you describe. > >strangely direct mounts of old lustre on new clients on IB work ok, but >not via lnet routers to OPA. old lustre to new clients on tcp networks >are ok. lnet self tests OPA to IB also work fine, it's just when we do >the actual mounts... >anyway, we are going to try and resolve the problem by updating the >IEEL to 2.9 or 2.10 > >hmm, now that I think of it, we did have to tweak the ko2iblnd options >a lot on the lnet router to get it this stable. I forget the symptoms >we were seeing though, sorry. >we found the minimum common denominator settings between the IB network >and the OPA, and tuned ko2iblnd on the lnet routers down to that. if it >finds one OPA card then Lustre imposes an agressive OPA config on all >IB networks which made our mlx4 cards on a ipath/qib fabric unhappy. > >FWIW, for our hardware combo, ko2iblnd options are > > options ko2iblnd-opa peer_credits=8 peer_credits_hiw=0 credits=256 > concurrent_sends=0 ntx=512 map_on_demand=0 fmr_pool_size=512 > fmr_flush_trigger=384 fmr_cache=1 conns_per_peer=1 > >I don't know what most of these do, so please take with a grain of salt. > >cheers, >robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre traffic slow on OPA fabric network
Hi Kurt, On Tue, Jul 03, 2018 at 02:59:22PM -0400, Kurt Strosahl wrote: > I've been seeing a great deal of slowness from clients on an OPA network > accessing lustre through lnet routers. The nodes take very long to complete > things like lfs df, and show lots of dropped / reestablished connections. > The OSS systems show this as well, and occasionally will report that all > routes are down to a host on the omnipath fabric. They also show large > numbers of bulk callback errors. The lnet router show large numbers of > PUT_NACK messages, as well as Abort reconnection messages for nodes on the > OPA fabric. I don't suppose you're talking to a super-old Lustre version via the lnet routers? we see excellent performance OPA to IB via lnet routers wth 2.10.x clients and 2.9 servers, but when we try to talk to a IEEL 2.5.41 servers then we see pretty much exactly the symptoms you describe. strangely direct mounts of old lustre on new clients on IB work ok, but not via lnet routers to OPA. old lustre to new clients on tcp networks are ok. lnet self tests OPA to IB also work fine, it's just when we do the actual mounts... anyway, we are going to try and resolve the problem by updating the IEEL to 2.9 or 2.10 hmm, now that I think of it, we did have to tweak the ko2iblnd options a lot on the lnet router to get it this stable. I forget the symptoms we were seeing though, sorry. we found the minimum common denominator settings between the IB network and the OPA, and tuned ko2iblnd on the lnet routers down to that. if it finds one OPA card then Lustre imposes an agressive OPA config on all IB networks which made our mlx4 cards on a ipath/qib fabric unhappy. FWIW, for our hardware combo, ko2iblnd options are options ko2iblnd-opa peer_credits=8 peer_credits_hiw=0 credits=256 concurrent_sends=0 ntx=512 map_on_demand=0 fmr_pool_size=512 fmr_flush_trigger=384 fmr_cache=1 conns_per_peer=1 I don't know what most of these do, so please take with a grain of salt. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] flock vs localflock
Hi Darby, On Thu, Jul 05, 2018 at 09:26:36PM +, Vicker, Darby (JSC-EG311) wrote: >Also, the ldlm processes lead us to looking at flock vs localflock. On >previous generations of our LFS???s, we used localflock. But on the current >LFS, we decided to try flock instead. This LFS has been in production for a >couple years with no obvious problems due to flock but we decided to drop back >to localflock as a precaution for now. We need to do a more controlled test >but this does seem to help. What are other sites using for locking parameters? we use flock for /home and the large scratch filesystem. have done for probably 10 years. localflock for the read-only software installs in /apps, and no locking for the OS image (overlayfs with ramdisk upper, read-only Lustre lower). we are all ZFS and 2.10.4 too. I don't think we have much in the way of flock user codes, so I can't actually recall any issues along those lines. the most common MDS abusing load we see is jobs across multiple nodes appending to the same (by definition rubbish) output file. the write lock bounces between nodes and causes high MDS load, poor performance for those clients nodes, bit slower for everyone. I look for these simply with 'lsof' and correlate across nodes. HTH cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] dealing with maybe dead OST
Hi Andreas, On Wed, Jun 20, 2018 at 05:39:33PM +, Andreas Dilger wrote: >On Jun 19, 2018, at 09:33, Robin Humble wrote: >> is there a way to mv files when their OST is unreachable? >> ... >> the only thing I've thought of seems pretty out there... >> mount the MDT as ldiskfs and mv the affected files into the shadow >> tree at the ldiskfs level. >> ie. with lustre running and mounted, create an empty shadow tree of >> all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on >> the MDT: >> for f in ; do >> mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f >> done >> >> would that work? > >This would work to some degree, but the "link" xattr on each file >would not be updated, so "lfs fid2path" would be broken until a >full LFSCK is run. although as you say, it turns out the rename() approach at the client level will work fine, it's still good to know that Lustre is flexible and robust enough for some crazy stuff to work if it had to :) >> alternatively, should we just unlink all the currently dead files from >> lustre now, and then if the OST comes back can we reconstruct the paths >> and filenames from the FID in xattrs's on the revived OST? >> I suspect unlink is final though and this wouldn't work... ? > >That would be possible, but overly complex, since the inodes would be >removed from the MDT and you'd need to reconstruct them with LFSCK and >find the names, as LFSCK would dump them all into $MNT/.lustre/lost+found. > >> we can also take an lvm snapshot of the MDT and refer to that later I >> suppose, but I'm not sure how that might help us. > >It should be possible to copy the unlinked files from the backup MDT >to the current MDT (via ldiskfs), along with an LFSCK run to rebuild >the OI files. It is always a good idea to have an MDT device-level >backup before you do anything drastic like this. However, for the >meantime I think that renaming the broken files to a root-only directory >is the safest. thanks (as always) for all the detailed explanations. much appreciated. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error
On Tue, Jun 26, 2018 at 04:05:14PM +0800, yu sun wrote: >hi all: > I want to build a lustre storage system, and I found not all of the >machine in the same sub-network, and they cant lctl ping with each other. >the details list as below: > >root@ml-storage-ser30.nmg01:~$ lctl list_nids >10.82.145.2@o2ib >root@ml-storage-ser30.nmg01:~$ lctl ping node28@o2ib >failed to ping 10.82.143.202@o2ib: Input/output error >root@ml-storage-ser30.nmg01:~$ what does 'lctl list_nids' say on node28? also disable iptables everywhere. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] dealing with maybe dead OST
On Wed, Jun 20, 2018 at 10:20:09AM -0400, Robin Humble wrote: >On Tue, Jun 19, 2018 at 08:54:53PM +, Cowe, Malcolm J wrote: >>Would using hard links work, instead of mv? ah. success! looks like it's just that gnu 'mv' and 'ln' are wy too smart for their own good. you got me thinking... what are 'mv' and 'ln' doing lstat() for anyway? so I wrote a few lines of C and stdio's rename() "just works" on the client, even when the OST is disabled (as it damn well should). too easy... happily python's os.rename() works too ('cos I am lazy) whoo! no need to mess with the MDT. that's a relief. thanks :) cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] dealing with maybe dead OST
Hi Malcolm, thanks for replying. On Tue, Jun 19, 2018 at 08:54:53PM +, Cowe, Malcolm J wrote: >Would using hard links work, instead of mv? hmm, interesting idea, but no: # ln some_file /lustre/shadow/some_file ln: failed to access 'some_file' Cannot send after transport endpoint shutdown ln is trying to lstat() which fails. I think almost all client operations are going to fail with a deactivated/down OST. things like 'lfs getstripe' (pure MDS ops) work ok. or did you mean doing hard links on the MDT? unless there's a purely MDS lustre tool to do a mv/rename operation on the MDT, then I think the only option is to mess around with the low level suff on the MDT when it's mounted as ldiskfs and hope I don't break too much... there used to be a 'lfs mv' (now 'lfs migrate') but that isn't quite the mv operations I'm after. any advice or war stories (especially "this is a waste of your time - it will never work because of X,Y,Z") would be much appreciated :) time to read more of the lustre manual now... cheers, robin >Malcolm. > > >???On 20/6/18, 1:34 am, "lustre-discuss on behalf of Robin Humble" >rjh+lus...@cita.utoronto.ca> wrote: > >Hi, > >so we've maybe lost 1 OST out of a filesystem with 115 OSTs. we may >still be able to get the OST back, but it's been a month now so >there's pressure to get the cluster back and working and leave the >files missing for now... > >the complication is that because the OST might come back to life we >would like to avoid the users rm'ing their broken files and potentially >deleting them forever. > >lustre is 2.5.41 ldiskfs centos6.x x86_64. > >ideally I think we'd move all the ~2M files on the OST to a root access >only "shadow" directory tree in lustre that's populated purely with >files from the dead OST. >if we manage to revive the OST then these can magically come back to >life and we can mv them back into their original locations. > >but currently > mv: cannot stat 'some_file': Cannot send after transport endpoint > shutdown >the OST is deactivated on the client. the client hangs if the OST isn't >deactivated. the OST is still UP & activated on the MDS. > >is there a way to mv files when their OST is unreachable? > >seems like mv is an MDT operation so it should be possible somehow? > > >the only thing I've thought of seems pretty out there... >mount the MDT as ldiskfs and mv the affected files into the shadow >tree at the ldiskfs level. >ie. with lustre running and mounted, create an empty shadow tree of >all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on >the MDT: > for f in ; do > mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f > done > >would that work? >maybe we'd also have to rebuild OI's and lfsck - something along the >lines of the MDT restore procedure in the manual. hopefully that would >all work with an OST deactivated. > > >alternatively, should we just unlink all the currently dead files from >lustre now, and then if the OST comes back can we reconstruct the paths >and filenames from the FID in xattrs's on the revived OST? >I suspect unlink is final though and this wouldn't work... ? > >we can also take an lvm snapshot of the MDT and refer to that later I >suppose, but I'm not sure how that might help us. > >as you can probably tell I haven't had to deal with this particular >situation before :) > >thanks for any help. > >cheers, >robin >___ >lustre-discuss mailing list >lustre-discuss@lists.lustre.org >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > >___ >lustre-discuss mailing list >lustre-discuss@lists.lustre.org >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] dealing with maybe dead OST
Hi, so we've maybe lost 1 OST out of a filesystem with 115 OSTs. we may still be able to get the OST back, but it's been a month now so there's pressure to get the cluster back and working and leave the files missing for now... the complication is that because the OST might come back to life we would like to avoid the users rm'ing their broken files and potentially deleting them forever. lustre is 2.5.41 ldiskfs centos6.x x86_64. ideally I think we'd move all the ~2M files on the OST to a root access only "shadow" directory tree in lustre that's populated purely with files from the dead OST. if we manage to revive the OST then these can magically come back to life and we can mv them back into their original locations. but currently mv: cannot stat 'some_file': Cannot send after transport endpoint shutdown the OST is deactivated on the client. the client hangs if the OST isn't deactivated. the OST is still UP & activated on the MDS. is there a way to mv files when their OST is unreachable? seems like mv is an MDT operation so it should be possible somehow? the only thing I've thought of seems pretty out there... mount the MDT as ldiskfs and mv the affected files into the shadow tree at the ldiskfs level. ie. with lustre running and mounted, create an empty shadow tree of all dirs under eg. /lustre/shadow/, and then at the ldiskfs level on the MDT: for f in ; do mv /mnt/mdt0/ROOT/$f /mnt/mdt0/ROOT/shadow/$f done would that work? maybe we'd also have to rebuild OI's and lfsck - something along the lines of the MDT restore procedure in the manual. hopefully that would all work with an OST deactivated. alternatively, should we just unlink all the currently dead files from lustre now, and then if the OST comes back can we reconstruct the paths and filenames from the FID in xattrs's on the revived OST? I suspect unlink is final though and this wouldn't work... ? we can also take an lvm snapshot of the MDT and refer to that later I suppose, but I'm not sure how that might help us. as you can probably tell I haven't had to deal with this particular situation before :) thanks for any help. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] High MDS load, but no activity
Hi Kevin, On Thu, Jul 27, 2017 at 08:18:04AM -0400, Kevin M. Hildebrand wrote: >We recently updated to Lustre 2.8 on our cluster, and have started seeing >some unusal load issues. >Last night our MDS load climbed to well over 100, and client performance >dropped to almost zero. >Initially this appeared to be related to a number of jobs that were doing >large numbers of opens/closes, but even after killing those jobs, the MDS >load did not recover. > >Looking at stats in /proc/fs/lustre/mdt/scratch-MDT/exports showed >little to no activity on the MDS. Looking at iostat showed almost no disk >activity to the MDT (or to any device, for that matter), and minimal IO wait. >Memory usage (the machine has 128GB) showed over half of that memory free. sounds like VM spinning to me. check /proc/zoneinfo, /proc/vmstat etc. do you have zone_reclaim_mode=0? that's an olde, but important to have set to zero. sysctl vm.zone_reclaim_mode failing that (and assuming you have a 2 or more numa zone server) I would guess it's all the zone affinity stuff in lustre these days. you can turn most of it off with a modprobe option options libcfs cpu_npartitions=1 what happens by default is that a bunch of lustre threads are bound to numa zones and preferentially and agressively allocate kernel ram in those zones. in practice this usually means that the zone where IB card is physically attached fills up, and then the machine is (essentially) out of ram and spinning hard trying to reclaim, even though all the ram in the other zone(s) is almost all unused. I tried to talk folks out of having affinity on by default in https://jira.hpdd.intel.com/browse/LU-5050 but didn't succeed. even if it wasn't unstable to have affinity on, IMHO having 2x the ram available for caching on the MDS and OSS's is #1, and tiny performance increases from having that ram next to the IB card is a distant #2. cheers, robin >I eventually ended up unmounting the MDT and failing it over to a backup >MDS, which promptly recovered and now has a load of near zero. > >Has anyone seen this before? Any suggestions for what I should look at if >this happens again? > >Thanks! >Kevin > >-- >Kevin Hildebrand >University of Maryland, College Park >Division of IT >___ >lustre-discuss mailing list >lustre-discuss@lists.lustre.org >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] seclabel
On Tue, May 23, 2017 at 08:08:54PM +, Dilger, Andreas wrote: >On May 19, 2017, at 08:47, Robin Humble <rjh+lus...@cita.utoronto.ca> wrote: >> On Wed, May 17, 2017 at 02:37:31PM +, Sebastien Buisson wrote: >>> Le 17 mai 2017 à 16:16, Robin Humble <rjh+lus...@cita.utoronto.ca> a écrit : >>>> I took a gander at the source and noticed that llite/xattr.c >>>> deliberately filters out 'security.capability' and returns 0/-ENODATA >>>> for setcap/getcap, which is indeed what strace sees. so setcap/getcap >>>> is never even sent to the MDS. >>>> >>>> if I remove that filter (see patch on lustre-devel) then setcap/getcap >>>> works -> >> ... >>>> 'b15587' is listed as the reason for the filtering. >>>> I don't know what that refers to. >>>> is it still relevant? >>> b15587 refers to the old Lustre Bugzilla tracking tool: >>> https://projectlava.xyratex.com/show_bug.cgi?id=15587 >>> >>> Reading the discussion in the ticket, supporting xattr at the time of >>> Lustre 1.8 and 2.0 was causing issues on MDS side in some situations. So it >>> was decided to discard security.capability xattr on Lustre client side. I >>> think Andreas might have some insight, as he apparently participated in >>> b15587. >> >> my word that's a long time ago... >> I don't see much in the way of jira tickets about getxattr issues on >> MDS in recent times, and they're much more heavily used these days, so >> I hope that particular problem has long since been fixed. >> >> should I open a jira ticket to track re-enabling of security.capabilities? LU-9562 thanks for everyone's help! >I don't recall the details of b=15587 off the top of my head, but the >high-level issue is >that the security labels added a significant performance overhead, since they >were retrieved >on every file access, but not cached on the client, even if most systems never >used them. > >Seagate implemented the client-side xattr cache for Lustre 2.5, so this should >work a lot >better these days. I'm not 100% positive if we also cache negative xattr >lookups or not, >so this would need some testing/tracing to see if it generates a large number >of RPCs. fair enough. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] seclabel
Hi Sebastien, On Wed, May 17, 2017 at 02:37:31PM +, Sebastien Buisson wrote: > Le 17 mai 2017 à 16:16, Robin Humble <rjh+lus...@cita.utoronto.ca> a écrit : >> I took a gander at the source and noticed that llite/xattr.c >> deliberately filters out 'security.capability' and returns 0/-ENODATA >> for setcap/getcap, which is indeed what strace sees. so setcap/getcap >> is never even sent to the MDS. >> >> if I remove that filter (see patch on lustre-devel) then setcap/getcap >> works -> ... >> 'b15587' is listed as the reason for the filtering. >> I don't know what that refers to. >> is it still relevant? >b15587 refers to the old Lustre Bugzilla tracking tool: >https://projectlava.xyratex.com/show_bug.cgi?id=15587 > >Reading the discussion in the ticket, supporting xattr at the time of Lustre >1.8 and 2.0 was causing issues on MDS side in some situations. So it was >decided to discard security.capability xattr on Lustre client side. I think >Andreas might have some insight, as he apparently participated in b15587. my word that's a long time ago... I don't see much in the way of jira tickets about getxattr issues on MDS in recent times, and they're much more heavily used these days, so I hope that particular problem has long since been fixed. should I open a jira ticket to track re-enabling of security.capabilities? >In any case, it is important to make clear that file capabilities, the feature >you want to use, is completely distinct from SELinux. >On the one hand, Capabilities are a Linux mechanism to refine permissions >granted to privileged processes, by dividing the privileges traditionally >associated with superuser into distinct units (known as capabilities). >On the other hand, SELinux is the Linux implementation of Mandatory Access >Control. >Both Capabilities and SELinux rely on values stored into file extended >attributes, but this is the only thing they have in common. 10-4. thanks. 'ls --color' requests the security.capability xattr so this would be heavily accessed. do you think this is handled well enough currently to not affect performance significantly? setxattr would be minimal and not performance critical, unlike with eg. selinux and creat. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] seclabel
I setup a couple of VMs with 2.9 clients and servers (ldiskfs) and unfortunately setcap/getcap still are unhappy - same as with my previous 2.9 clients with 2.8 servers (ZFS). hmm. I took a gander at the source and noticed that llite/xattr.c deliberately filters out 'security.capability' and returns 0/-ENODATA for setcap/getcap, which is indeed what strace sees. so setcap/getcap is never even sent to the MDS. if I remove that filter (see patch on lustre-devel) then setcap/getcap works -> # df . Filesystem1K-blocks Used Available Use% Mounted on 10.122.1.5@tcp:/test8 4797904 33992 4491480 1% /mnt/test8 # touch blah # setcap cap_net_admin,cap_net_raw+p blah # getcap blah blah = cap_net_admin,cap_net_raw+p and I also tested that the 'ping' binary run as unprivileged user works from lustre. success! 'b15587' is listed as the reason for the filtering. I don't know what that refers to. is it still relevant? cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] seclabel
Hi Eli et al, >> Le 15 mai 2017 à 14:39, E.S. Rosenberga écrit : >> Hi Robin, >> Did you ever solve this? >> We are considering trying root-on-lustre but that would be a deal-breaker. no. instead I started down the track of layering overlayfs on top of lustre. tmpfs (used by overlayfs's upper layer) has a working seclabel mount option. so I just 'copy up' the 3 or 4 exe's that have seclabels, 'setcap' them with the right label, and they work fine. I'm not sure overlayfs is going to work out though, so I'd really like seclabel in lustre. On Tue, May 16, 2017 at 08:17:48AM +, Sebastien Buisson wrote: >From Lustre 2.8, we have basic support of SELinux on Lustre client side. It >means Lustre stores the security context of files in extended attributes. In >this way Lustre supports seclabel. >In Lustre 2.9, an additional enhancement for SELinux support was landed. > >Which version are you using? 2.9 clients, 2.8 servers on ZFS. centos7 x86_64 everywhere. sestatus disabled everywhere. zfs has xattr=sa on osts, mdt, mgs Andreas wrote (a while ago): >> I try to stay away from that myself, but newer Lustre clients support SELinux >> and similar things. You probably need to strace and/or collect some kernel >> debug logs (maybe with debug=-1 set) to see where the error is being >> generated. a debug=-1 trace is here -> https://rjh.org/~rjh/lustre/dk.log.-1.txt.gz command line was -> lctl set_param debug=-1 ; usleep 5; lctl clear; usleep 5 ; /usr/sbin/setcap cap_net_admin,cap_net_raw+p /mnt/oneSIS-overlay/lowerdir/usr/bin/ping ; /usr/sbin/getcap /mnt/oneSIS-overlay/lowerdir/usr/bin/ping ; lctl dk /lfs/data0/system/log/dk.log.-1 ; lctl set_param debug='ioctl neterror warning error emerg ha config console lfsck' /mnt/oneSIS-overlay/lowerdir is the lustre root filesystem image (usually mounted read-only, but read-write for this debugging) expected output is nothing for setcap. expected output for getcap is # getcap /mnt/oneSIS-overlay/lowerdir/usr/bin/ping /mnt/oneSIS-overlay/lowerdir/usr/bin/ping = cap_net_admin,cap_net_raw+p but actual output is nothing -> # getcap /mnt/oneSIS-overlay/lowerdir/usr/bin/ping # to the copy of 'ping' on the tmpfs/overlayfs getcap/setcap works fine -> # getcap /usr/bin/ping /usr/bin/ping = cap_net_admin,cap_net_raw+p cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] seclabel
Hiya, I'm updating an image for a root-on-lustre cluster from centos6 to 7 and I've hit a little snag. I can't seem to mount lustre so that it understands seclabel. ie. setcap/getcap don't work. the upshot is that root can use ping (and a few other tools), but users can't. any idea what I'm doing wrong? from what little I understand about it I think seclabel is a form of xattr. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
Hi Mark, On Tue, Apr 12, 2016 at 04:49:10PM -0400, Mark Hahn wrote: >One of our MDSs is crashing with the following: > >BUG: unable to handle kernel paging request at deadbeef >IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs] >PGD 0 >Oops: 0002 [#1] SMP > >The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64 >with about 2k clients ranging from 1.8.8 to 2.6.0 I saw an identical crash in Sep 2014 when the MDS was put under memory pressure. >to be related to vm.zone_reclaim_mode=1. We also enabled quotas zone_reclaim_mode should always be 0. 1 is broken. hung processes perpetually 'scanning' in one zone in /proc/zoneinfo whilst plenty of pages are free in another zone is a sure sign of this issue. however if you have vm.zone_reclaim_mode=0 now and are still seeing the issue, then I would suspect that lustre's overly agresssive memory affinity code is partially to blame. at the very least it is most likely stopping you from making use of half your MDS ram. see https://jira.hpdd.intel.com/browse/LU-5050 set options libcfs cpu_npartitions=1 to fix it. that's what I use on OSS and MDS for all my clusters. cheers, robin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] ll_ost thread soft lockup
On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote: You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix). md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud. as you say, the md rebuild bug is in all kernels 2.6.32 http://marc.info/?l=linux-raidm=130192650924540w=2 the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1. I also asked RedHat nicely, and they added the same patch to RHEL5.8 kernels, which IMHO is the correct place for a fundamental md fix. so once Lustre supports RHEL5.8 servers, then the patch in Lustre isn't needed any more. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] exceedingly slow lstats
On Fri, Jan 20, 2012 at 02:35:19PM -0800, John White wrote: Well, I was reading the strace wrong anyway: lstat(../403/a323, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.134326 getxattr(../403/a323, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported) 0.18 lstat(../403/a330, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.158898 getxattr(../403/a330, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported) 0.19 lstat(../403/a331, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.239466 getxattr(../403/a331, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported) 0.12 lstat(../403/a332, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 0.130146 getxattr(../403/a332, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported) 0.12 The getxattr takes an incredibly short amount of time, it's the lstat itself that's taking 0.1+s. it used to be that weird slowdowns and high load could be caused by kernel zone_reclaim confusion, so firstly I'd suggest checking that vm.zone_reclaim_mode=0 everywhere (clients and servers). after that see if turning off read write_through caches on OSS's helps metadata rates. there's a fair chance that streaming i/o to OSS's is filling OSS ram and pushing inodes/dentries out of OSS vfs cache causing big metadata slowdowns - the more streaming i/o the greater the slowdown. if turning off the data caches fixes the problem for you (ie. it's not faulty hardware or an old lustre version or something else) then there are couple of different methods that could let you get both data caching and good metadata rates, but first things first... cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]
rejoining this topic after a couple of weeks of experimentation Re: trying to improve metadata performance - we've been running with vfs_cache_pressure=0 on OSS's in production for over a week now and it's improved our metadata performance by a large factor. - filesystem scans that didn't finish in ~30hrs now complete in a little over 3 hours. so ~10x speedup. - a recursive ls -altrR of my home dir (on a random uncached client) now runs at 2000 to 4000 files/s wheras before it could be 100 files/s. so 20 to 40x speedup. of course vfs_cache_pressure=0 can be a DANGEROUS setting because inodes/dentries will never be reclaimed, so OSS's could OOM. however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so I expect many sites can (like us) easily cache everything. for a given number of inodes per OST it's easily calculable whether there's enough OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab. continued monitoring of the fs inode growth (== OSS slab size) over time is very important as fs's will inevitably acrue more files... sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful at keeping stat rates high. sustained OSS cache memory pressure through the day dropped enough inodes that nightly scans weren't fast any more. our current residual issue with vfs_cache_pressure=0 is unexpected. the number of OSS dentries appears to slowly grow over time :-/ it appears that some/many dentries for deleted files are not reclaimed without some memory pressure. any idea why that might be? anyway, I've now added a few lines of code to create a different (non-zero) vfs_cache_pressure knob for dentries. we'll see how that goes... an alternate (simpler) workaround would be to occasionally drop OSS inode/dentry caches, or to set vfs_cache_pressure=100 once in a while, and to just live with a day of slow stat's while the inode caches repopulate. hopefully vfs_cache_pressure=0 also has a net small positive impact on regular i/o due to reduced iops to OSTs, but I haven't trid to measure that. slab didn't steal much ram from our read and write_through caches (we have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the additional cached inodes/dentries) so OSS file caching should be almost unaffected. On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote: On Jan 27, 2011, at 11:34 PM, Robin Humble wrote: limiting the total amount of OSS cache used in order to leave room for inodes/dentries might be more useful. the data cache will always fill up and push out inodes otherwise. I disagree with myself now. I think mm/vmscan.c would probably still call shrink_slab, so shrinkers would get called and some cached inodes would get dropped. The inode and dentry objects in the slab cache aren't so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not. on a test cluster (with read and write_through caches still active and synthetic i/o load) I didn't see a big change in stat rate from dropping OSS page/buffer cache - at most a slowdown for a client 'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is because there is almost zero persistent buffer cache due to the OSS buffer and page caches being punished by file i/o. in the same testing, dropping OSS inode/dentry caches was a much larger effect (up to 60x slowdown with synthetic i/o) - which is why the vfs_cache_pressure setting works. the synthetic i/o wasn't crazily intensive, but did have a working set OSS mem which is likely true of our production machine. however for your setup with OSS caches off, and from doing tests on our MDS, I agree that buffer caches can be a big effect. dropping our MDS buffer cache slows down a client 'lfs find' by ~4x, but dropping inode/dentry caches doesn't slow it down at all, so buffers are definitely important there. happily we're not under any memory pressure on our MDS's at the moment. We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I've noted a few causes of slowdowns so far; there may be more. we see about 2k files/s on the nightly sweeps now. that's with one lfs find running and piping to parallel stat's. I think we can do better with more parallelism in the finds, but 2k is so much better than what it used to be we're fairly happy for now. 2k isn't as good as your stat rates, but we still have OSS caches on, so the rest of our i/o should be benefiting from that. When memory runs low on a client, kswapd
Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote: It would probably be better to set: lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M or similar, to limit the read cache to files 32MB in size or less (or whatever you consider small files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. We should probably change this to be the default, but at the time the read cache was introduced, we didn't know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this. limiting the total amount of OSS cache used in order to leave room for inodes/dentries might be more useful. the data cache will always fill up and push out inodes otherwise. Nathan's approach of turning off the caches entirely is extreme, but if it gives us back some metadata performance then it might be worth it. or is there a Lustre or VM setting to limit overall OSS cache size? I presume that Lustre's OSS caches are subject to normal Linux VM pagecache tweakables, but I don't think such a knob exists in Linux at the moment... I was looking through the Linux vm settings and saw vfs_cache_pressure - has anyone tested performance with this parameter? Do you know if this would this have any effect on file caching vs. ext4 metadata caching? For us, Linux/Lustre would ideally push out data before the metadata, as the performance penalty for doing 4k reads on the s2a far outweighs any benefits of data caching. good idea. if all inodes are always cached on OSS's then the fs should be far more responsive to stat loads... 4k/inode shouldn't use up too much of the OSS's ram (probably more like 1 or 2k/inode really). anyway, following your idea, we tried vfs_cache_pressure=50 on our OSS's a week or so ago, but hit this within a couple of hours https://bugzilla.lustre.org/show_bug.cgi?id=24401 could have been a coincidence I guess. did anyone else give it a try? BTW, we recently had the opposite problem on a client that scans the filesystem - too many inodes were cached leading to low memory problems on the client. we've had vfs_cache_pressure=150 set on that machine for the last month or so and it seems to help. although a more effective setting in this case was limiting ldlm locks. eg. from the Lustre manual lctl set_param ldlm.namespaces.*osc*.lru_size=1 cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
Hi Nathan, On Thu, Jan 06, 2011 at 05:42:24PM -0700, nathan.dau...@noaa.gov wrote: I am looking for more information regarding the size on MDS feature as it exists for lustre-1.8.x. Testing on our system (which started out as 1.6.6 and is now 1.8.x) indicates that there are many files which do not have the size information stored on the MDT. So, my basic question: under what conditions will the size hint attribute be updated? Is there any way to force the MDT to query the OSTs and update it's information? atime (and the MDT size hint) wasn't being updated for most of the 1.8 series due to this bug: https://bugzilla.lustre.org/show_bug.cgi?id=23766 the atime fix is now in 1.8.5, but I'm not sure if anyone has verified whether or not the MDT size hint is now behaving as originally intended. actually, it was never clear to me what (if anything?) ever accessed OBD_MD_FLSIZE... does someone have a hacked 'lfs find' or similar tool? your approach of mounting and searching a MDT snapshot should be possible, but it would seem neater just to have a tool on a client send the right rpc's to the MDS and get the information that way. like you, we are finding that the timescales for our filesystem trawling scripts are getting out of hand, mostly (we think) due to retrieving size information from very busy OSTs. a tool that only hit the MDT and found (filename, uid, gid, approx size) should help a lot. so +1 on this topic. BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT size hints might be to read 4k from every file in the system. that should update atime and the size hint. please let us know if this works. The end goal of this is to facilitate efficient checks of disk usage on a per-directory basis (essentially we want volume based quotas). I'm a possible approach for your situation would be to chgrp every file under a directory to be the same gid, and then enable (un-enforcing) group quotas on your filesystem. then you wouldn't have to search any directories. you would still have to find and chgrp some files nightly, but 'lfs find' should make that relatively quick. unfortunately we also need a breakdown of the uid information in each directory, so this approach isn't sufficient for us. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility hoping to run something once a day on the MDS like the following: lvcreate -s -p r -n mdt_snap /dev/mdt mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap cd /mnt/snap/ROOT du --apparent-size ./* volume_usage.log cd / umount /mnt/snap lvremove /dev/mdt_snap Since the data is going to be up to one day old anyway, I don't really mind that the file size is approximate, but it does have to be reasonably close. With the MDT LVM snapshot method I can check the whole 300TB file system in about 3 hours, whereas checking from a client takes weeks. Here is why I am relatively certain that the size-on-MDS attributes are not updated (lightly edited): [r...@mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file -rw-r--r-- 1 9000 0 Mar 23 2010 /mnt/snap/ROOT/test/rollover/user_acct_file [r...@mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file 0 /mnt/snap/ROOT/test/rollover/user_acct_file [r...@mds0 ~]# du --apparent-size /mnt/snap/ROOT/test/rollover/user_acct_file 0 /mnt/snap/ROOT/test/rollover/user_acct_file [r...@c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file -rw-r--r-- 1 user group 184435207 Mar 23 2010 /mnt/lfs0/test/rollover/user_acct_file [r...@c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file 180120 /mnt/lfs0/test/rollover/user_acct_file [r...@c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file 180113 /mnt/lfs0/test/rollover/user_acct_file Thanks very much for any answers or suggestions you can provide! -Nathan ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Parallel fortran program bug.
On Thu, Dec 23, 2010 at 03:48:50PM +0100, Roy Dragseth wrote: On Thursday, December 23, 2010 15:18:13 Rick Grubin wrote: We have an occasional problem with parallel fortran programs that open files with status old or unknown returns errors on open. This seems Sounds like bug 17545: https://bugzilla.lustre.org/show_bug.cgi?id=17545 The issue is fixed for v1.8.2 and beyond. Thanks a lot for your quick reply! This seems to be it, we will upgrade next week. if you are using Intel Fortran, then I think your open() failures will probably continue even with latest Lustre, but at a lower rate. see https://bugzilla.lustre.org/show_bug.cgi?id=23978 this bug has flown under the radar a bit as it causes fairly cryptic app failures, and only Intel fortran hits it with any frequency. what the user sees usually just looks like a failed open with an oddly corrupted filename string. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1
Hi Jagga, On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote: .. start seeing this issue. All my clients are setup with SLES11 and the same packages with the exception of a newer kernel in the 1.8.4 environment due to the lustre dependency: reshpc208:~ # uname -a Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 x86_64 x86_64 GNU/Linux ... open(/proc/9598/stat, O_RDONLY) = 6 read(6, 9598 (gsnap) S 9596 9589 9589 0 ..., 1023) = 254 close(6)= 0 open(/proc/9598/status, O_RDONLY) = 6 read(6, Name:\tgsnap\nState:\tS (sleeping)\n..., 1023) = 1023 close(6)= 0 open(/proc/9598/cmdline, O_RDONLY)= 6 read(6, did you get any further with this? we've just seen something similar in that we had D state hung processes and a strace of ps hung at the same place. in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma' magically caused all the D state processes to continue... we don't have a good idea why this might be. looks kinda like a generic kernel shm deadlock, possibly unrelated to Lustre. sys_shmdt features in the hung process tracebacks that the kernel prints out. if you do 'lsof' do you see lots of /dev/shm entries for your app? the app we saw run into trouble was using HPMPI which is common in commercial packages. does gsnap use HPMPI? we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on this cluster. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1
On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote: Doing a ps just hangs on the system and I need to just close and reopen a session to the effected system. The application (gsnap) is running from the lustre filesystem and doing all IO to the lustre fs. Here is a strace of where ps hangs: one possible cause of hung processes (that's not Lustre related) is the VM tying itself in knots. are your clients NUMA machines? is /proc/sys/vm/zone_reclaim_mode = 0? I guess this explanation is a bit unlikely if your only change is the client kernel version, but you don't say what you changed it from and I'm not familiar with SLES, so the possibility is there, and it's an easy fix (or actually a dodgy workaround) if that's the problem. -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ost's reporting full
Hey Dr Stu, On Sat, Sep 11, 2010 at 04:27:43PM +0800, Stuart Midgley wrote: We are getting jobs that fail due to no space left on device. BUT none of our lustre servers are full (as reported by lfs df -h on a client and by df -h on the oss's). They are all close to being full, but are not actually full (still have ~300gb of space left) sounds like a grant problem. I've tried playing around with tune2fs -m {0,1,2,3} and tune2fs -r 1024 etc and nothing appears to help. Anyone have a similar problem? We are running 1.8.3 there are a couple of grant leaks that are fixed in 1.8.4 eg. https://bugzilla.lustre.org/show_bug.cgi?id=22755 or see the 1.8.4 release notes. however the overall grant revoking problem is still unresolved AFAICT https://bugzilla.lustre.org/show_bug.cgi?id=12069 and you'll hit that issue more frequently with many clients and small OSTs, or when any OST starts getting full. in your case 300g per OST should be enough headroom unless you have ~4k clients now (assuming 32-64m grants per client), so it's probably grant leaks. there's a recipe for adding up client grants and comparing them to server grants to see if they've gone wrong in bz 22755. cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] SSD caching of MDT
On Thu, Aug 19, 2010 at 01:29:37PM +0100, Gregory Matthews wrote: Article by Jeff Layton: http://www.linux-mag.com/id/7839 anyone have views on whether this sort of caching would be useful for the MDT? My feeling is that MDT reads are probably pretty random but writes might benefit...? if you look at the tiny size of inodes in slabtop on an MDS you'll see that all read ops for most fs's are probably 100% cached in ram by a decent sized MDS. ie. once you have traversed all inodes of a fs once, then likely the MDT's are a write-only media, and the ram of the MDS is a faster iop machine than any SSD could ever be. you are then left with a MDT workload of entirely small writes. that is definitely not a SSD sweet spot - many SSDs will fragment badly and slow down horrendously, which eg. JBODs of 15k rpm SAS disks will not do. basically beware of cheap SSDs, possibly any SSD, and certainly any SSD that isn't an Intel x25-e or better. the Marvell controller SSDs we sadly have many of now, I would not inflict upon any MDT. also, having experimented with ramdisk MDT's (not in production obviously), it is clear that even this 'perfect' media doesn't solve all Lustre iops problems. far from it. usually it just means that you hit algorithmic or numa problems in Lustre MDS code, or (more likely) the ops just flow onto the OSTs and those become the bottleneck instead. basically ramdisk MDT speedups weren't big over even just say, 16 fast FC or SAS disks. SSDs would be in-between if they were behaving perfectly, which would require extensive testing to determine. looking at it a different way, Lustre's statahead kinda works ok, create's are (IIRC) batched so also scale ok, so delete's might be the only workload left where the fastest MDT money can buy would get you any significant benefit... probably not worth the spend for most folks. assuming for a moment that SSDs worked as they should, then other Lustre related workloads for which SSDs might be suitable are external journals for OSTs, md bitmaps, or (one day) perhaps ZFS intent logs. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] best practice for lustre clustre startup
On Thu, Jul 01, 2010 at 11:17:31AM -0600, Kevin Van Maren wrote: My (personal) opinion: Lustre clients should always start (mount) automatically. yup Lustre servers should have their services started through heartbeat (or other HA package), if failover is possible (be sure to configure stonith). IMHO that's a bad idea. servers should not start automatically. my objections to automated mount/failover are not Lustre related, but to all layers underneath - as Kevin well knows, mptsas drivers can and do and have screwed up majorly and I'm sure other drivers have too. md is far from smart, and disks are broken in such an infinite amount of weird and wonderful ways that no driver or OS can reasonably be expected to deal with them all :-/ if you have the simple setup of singly-attached storage and a Lustre server just crashed, then why wouldn't it just crash again? we have had that happen. automated startup seems silly in this case - especially if you don't know what the problem was to start with. worst case is if the hardware started corrupting data and crashed the machine, is it really a good idea to reboot, remount, continue corrupting data more, and then keep rebooting until dawn? if you have a more elaborate Lustre setup with HA failover pairs then the above applies, and additionally there are inherent races in both nodes in a pair trying to mount a set of disks if you do not have a third impartial member participating in a failover chorum - not a common HA setup for Lustre, although it probably should be. if a sw raid is assembled on both machines at the same time because of a HA race, then it's likely data will be lost. Lustre mmp should save you from multi-mounting the OST, but obviously not from corruption if the underlying raid is pre-trashed. overall without diagnosing why a machine crashed I fail to see how an automated reboot or failover can possibly be a safe course of action. cheers, robin If heartbeat starts automatically, do ensure auto-failback is NOT enabled: fail the resources back manually after you verify the rebooted server is healthy. Whether heartbeat starts automatically seems to be a preference issue. While unlikely, it is possible for an issue to cause Lustre to not start successfully, resulting in a node crash or other issue preventing a login. So if it does start automatically you'll want to be prepared to reboot w/o Lustre (eg, single-user mode). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] soft lockups on NFS server/Lustre client
On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote: Hi List, on our NFS server exporting our Lustre file system to a number of NFS clients, we've recently started to see kernel: BUG: soft lockup messages. As the locked processes include nfsd, our users are obviously not happy. Around the time when the soft lockup occurs we also see a log of kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() messages, but I don't know if this is related. probably not related. we were seeing this too (no NFS involved at all) https://bugzilla.lustre.org/show_bug.cgi?id=20904 and the upshot is that I'm pretty sure it's harmless and a RHEL bug. I filed https://bugzilla.redhat.com/show_bug.cgi?id=526853 but it's probably being ignored. if you have a rhel support contract maybe you can kick it along a bit... dunno about your soft lockups. as I understand it soft lockups themselves aren't harmful as long as they progress eventually. Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS exporter? presumably soft lockups could also be saying your re-exporter or OSS's are overloaded or that you have a slow disk or 3 in a RAID... without NFS involved are all your OSTs up to speed? do you still get problems after echo 60 /proc/sys/kernel/softlockup_thresh cheers, robin We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS server/Lustre client with the lockups is running RHEL5.4 with an unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre modules from Sun. See below for sample logs from the Lustre client/NFS server. I can provide more logs if required. I'm not sure if this a Lustre issue but would appreciate if someone could help. We've not seen it on any other NFS server so far and there seems to be at least some lustre related stuff in the stack trace. Is this a known issue and how can we avoid this? I have not found anything using google and the search on bugzilla.lustre.org. At least the BUG warning seems to be a known issue on this kernel. I hope the logs below are readable enough, I tried to find entries where the stack traces don't overlap but this seems to be the best I can find. Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed7d1] set_dentry_child_flags+0xef/0x14d Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed867] remove_watch_no_event+0x38/0x47 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed88e] inotify_remove_watch_locked+0x18/0x3b Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed97c] inotify_rm_wd+0x7e/0xa1 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ede6e] sys_inotify_rm_watch+0x46/0x63 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [8005d28d] tracesys+0xd5/0xe0 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed7d1] set_dentry_child_flags+0xef/0x14d Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed867] remove_watch_no_event+0x38/0x47 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed88e] inotify_remove_watch_locked+0x18/0x3b Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ed97c] inotify_rm_wd+0x7e/0xa1 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [800ede6e] sys_inotify_rm_watch+0x46/0x63 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [nfsd:1] Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5: Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 1, comm: nfsd Tainted: G 2.6.18-92.1.10.el5 #1 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[80064ba7] [80064ba7] .text.lock.spinlock+0x5/0x30 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:810044241ac8 EFLAGS: 0286 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RAX: 81006cb6a1a8 RBX: 81006cb6a178 RCX: 810044241b50 Oct 9 15:21:28 cs04r-sc-serv-07
Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases
On Thu, Sep 10, 2009 at 12:35:54PM +0200, Johann Lombardi wrote: We have attached a new patch to bug 20560 which should address your problem which may happen in rare cases with partial truncates. as we are about to throw users onto the new system, can I ask for a quick update pointing us to the current best guess at a workaround/fix for the 1.8.1 read cache problems please? to me it looks like https://bugzilla.lustre.org/show_bug.cgi?id=20560 is still evolving, but it looks like writethrough_cache=0 should now work (and not crash the OSS) with attachment: https://bugzilla.lustre.org/attachment.cgi?id=25833 so if I patched our OSS's with just this one liner, then would that be enough to run with until the situation has had some time to bed in? or would we be better off with all 4 patches from 20560 applied (and both read cache's still off)? cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] hacking max_sectors
On Wed, Aug 26, 2009 at 04:11:12AM -0600, Andreas Dilger wrote: On Aug 26, 2009 00:46 -0400, Robin Humble wrote: with the patch, 1M i/o's are being fed to md (according to brw_stats), and performance is a little better for RAID6 8+2 with 128k chunks, and a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed half 512k and half 1M i/o's by Lustre). This was the other question I'd asked internally. If the array is formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify- write operations and (in theory) give the same performance as 1M IOs on a 128kB chunksize array. What is the relative performance of the 64kB and 128kB configurations? on these 1TB SATA RAID6 8+2's and external journals, with 1 client writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that 128k chunk md gets 1M i/o's and 64k chunk md gets 512k i/o's then - client max_rpcs_in_flight 8 md chunkwrite (MB/s)read (MB/s) 64k 185345 128k 235390 so 128k chunks are 10-30% quicker than 64k in this particular setup on big streaming i/o tests (1G of 1M lmdd's). having said that, 1.6.7.2 servers do better than 1.8.1 on some configs (I haven't had time to figure out why) but the trend of 128k chunks being faster than 64k chunks remains. also if the i/o load was messier and involved smaller i/o's then 64k chunks might claw something back - probably not enough though. BTW, whilst we're on the topic - what does this part of brw_stats mean? read | write disk fragmented I/Os ios % cum % | ios % cum % 1:5742 100 100 | 103186 100 100 this is for the 128k chunk case, where the rest of brw_stats says I'm seeing 1M rpc's and 1M i/o's, but I'm not sure what '1' disk fragmented i/o's means - should it be 0? or does '1' mean unfragmented? sorry for packing too many questions into one email, but these slowish SATA disks seem to need a lots of rpc's in flight for good performance. 32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good magic combo. with that I get: client max_rpcs_in_flight 32 md chunkwrite (MB/s)read (MB/s) 64k 275450 128k 395480 which is a lot faster... with a heavier load of 20 clients hammering 4 OSS's each with 4 R6 8+2 OSTs I still see about a 10% advantage for clients with 32 rpcs. is there a down side to running clients with max_rpcs_in_flight 32 ? the initial production machine will be ~1500 clients and ~25 OSS's. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] hacking max_sectors
Hiya, I've had another go at fixing the problem I was seeing a few months ago: http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html and which we are seeing again now as we are setting up a new machine with 128k chunk software raid (md) RAID6 8+2 eg. Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 I came up with the attached simple core kernel change which fixes the problem, and seems stable enough under initial stress testing, but a core scsi tweak seems a little drastic to me - is there a better way to do it? without this patch, and despite raising all disks to a ridiculously huge max_sectors_kb, all Lustre 1M rpc's are still fragmented into two 512k chunks before being sent to md :-/ likely md then aggregates them again 'cos performance isn't totaly dismal, which it would be if it was 100% read-modify-writes for each stripe write. with the patch, 1M i/o's are being fed to md (according to brw_stats), and performance is a little better for RAID6 8+2 with 128k chunks, and a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed half 512k and half 1M i/o's by Lustre). the one-liner is a core kernel change, so perhaps some Lustre/kernel block device/md people can look at it and see if it's acceptable for inclusion in standard Lustre OSS kernels, or whether it breaks assumptions in the core scsi layer somehow. IMHO the best solution would be to apply the patch, and then have a /sys/block/md*/queue/ for md devices so that max_sectors_kb and max_hw_sectors_kb can be tuned without recompiling the kernel... is that possible? the patch is against 2.6.18-128.1.14.el5-lustre1.8.1 cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h 2009-08-18 17:40:51.0 +1000 +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 13:47:55.0 +1000 @@ -778,7 +778,7 @@ #define MAX_PHYS_SEGMENTS 128 #define MAX_HW_SEGMENTS 128 #define SAFE_MAX_SECTORS 255 -#define BLK_DEF_MAX_SECTORS 1024 +#define BLK_DEF_MAX_SECTORS 2048 #define MAX_SEGMENT_SIZE 65536 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre and kernel vulnerability CVE-2009-2692
On Fri, Aug 21, 2009 at 06:41:01PM +0200, Thomas Roth wrote: Hi all, while trying to fix the recent kernel vulnerability (CVE-2009-2692) we found that in most cases, our Lustre 1.6.5.1, 1.6.6 and 1.6.7.2 clients seemed to be quite well protected, at least against the published exploit: wunderbar_emporium seems to work, but then the root shell never appears. Instead, the client freezes, requiring a reset. Anybody else with such experiences? no freezes here. wunderbar_emporium didn't work against rhel/centos 2.6.18-128.4.1.el5 with patchless Lustre 1.6.7.2 after it was patched with the upstream one-liner: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e694958388c50148389b0e9b9e9e8945cf0f1b98 no idea if it was exploitable before or not - didn't try. RedHat's view on this vulnerability is err, interesting... :-/ http://kbase.redhat.com/faq/docs/DOC-18065 https://bugzilla.redhat.com/show_bug.cgi?id=516949 Employing the recommended workaround by setting vm.mmap_min_addr to 4096 where did you see that recommended? the RHEL based machines I've looked at have this set to 64k, but if they are also running SELinux (which I presume few Lustre machines are?) then they still might be vulnerable I guess. cheers, robin blew up in our face: in particular machines with older kernels not knowing about mmap_min_addr reacted quite irrationally, such as segfaulting about every process running on the machine. Crazy things that should not be possible Regards, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
I added this to bugzilla. https://bugzilla.lustre.org/show_bug.cgi?id=20227 cheers, robin On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote: On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote: I get a ferocious set of error messages when I mount a 1.6.7.2 filesystem on a b_release_1_8_1 client. is this expected? just to annotate the below a bit in case it's not clear... sorry - should have done that in the first email :-/ 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the fs is mounted in the usual failover way eg. mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system from the below (and other similar logs) it kinda looks like the client fails and then renegotiates with all the servers. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: mgc10.8.30@o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 looks like it succeeds in the end, but only after a struggle. I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2. servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group quota fix). client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from cvs 20090712131220 + bz18793 again. BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I'm confused about which is closest to the final 1.8.1 :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mds adjust qunit failed
On Tue, Jul 21, 2009 at 01:50:43PM +0800, Lu Wang wrote: Dear list, I have gotten over 19000 quota-related errors on one MDS since 18:00 yesterday like: Jul 20 18:24:04 * kernel: LustreError: 10999:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! (opc:4 rc:-122) if you look through the Linux errno header files, you'll find -122 is EDQUOT/* Quota exceeded */ so someone or some group is over quota - either inodes or diskspace. it would be really good if this message said which uid/gid was over quota, and from which client, and on which filesystem. as you have found, the current message is not very informative and overly verbose. I was looking at the quota code around this message a few days ago, and it looks like it'd be really easy to add some extra info to the message but I have yet to test a toy patch I wrote... cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility Jul 20 18:29:27 * kernel: LustreError: 11007:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! (opc:4 rc:-122) Jul 21 13:44:27 * kernel: LustreError: 10999:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! (opc:4 rc:-122) # grep master /var/log/messages |wc 19628 255058 2665136 Dose any one know what does this mean? The mds is running on 2.6.9-67.0.22.EL_lustre.1.6.6smp. Best Regards Lu Wang -- Computing Center IHEP Office: Computing Center,123 19B Yuquan RoadTel: (+86) 10 88236012-607 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049,China Email: lu.w...@ihep.ac.cn -- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
I get a ferocious set of error messages when I mount a 1.6.7.2 filesystem on a b_release_1_8_1 client. is this expected? Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: mgc10.8.30@o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 looks like it succeeds in the end, but only after a struggle. I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2. servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group quota fix). client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from cvs 20090712131220 + bz18793 again. BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I'm confused about which is closest to the final 1.8.1 :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recreate metada - possible?
On Wed, Jul 15, 2009 at 08:32:27AM -0400, Brian J. Murrell wrote: On Wed, 2009-07-15 at 10:53 +0200, Tom Woezel wrote: Now the partition table on the raiddevice got deleted and cannot be recovered. Ouch. How did it get deleted? How come it cannot be recovered? A partition table is nothing more than a small area at the start of a disk that contains pointers (i.e. offsets on the disk) to where partitions start and end. if a kernel is still up and looking at the device then /proc/partitions and /sys/block/disk/* might well still contain enough valid data from which the previous partition table can be reconstructed. been there, dd'd over that. (almost) all good in the end :-) thankfully not to a Lustre fs, just my home server :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility Even if it was completely wiped, the process of scanning the entire disk looking for signatures that can help identify a likely partition beginning and then recreate the partition table is usually quite successful. You might want to look into the gpart tool for this. The OSTs are ok and the data on those should be fine. Yes. But all you have is file contents, nothing else. Now here is my question is, it possible to create a new MGS and new MDTs and somehow connect the old OSTs to them? No. There is nothing on the OSTs that indicate what file an object belongs to. This is why we are adamant about MDT storage being reliable and backed up. Is there a way to recreate the metadata with the data whis is held on the OSTs? No. I'm deeply grateful for any help or hint on this issue. Without knowing the whole story of your MDT/RAID saga, I'd say gpart is your best bet. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote: On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote: Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: mgc10.8.30@o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 These are all LND errors. What versions of OFED are you using on each end? all kernels all compiled with the rhel5 kernel tree's standard OFED. I think 1.3.2 is what's in rhel5.3/centos5.3? looks like it succeeds in the end, but only after a struggle. Is it completely stable and performant after the struggle? Do the error messages stop? the fs's appear to be fine. the error messages are just on the initial mount of the first lustre fs. subsequent mounts of other lustre fs's don't get any messages, so it seems like it's just an extremely noisy protocol/version negotiation the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2 servers?? another data point is that the above errors don't happen with 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel OFED, so it's probably something that's happened between 1.8.0.1 and 1.8.1-pre. or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and 2.6.18-128.1.16.el5, but that seems less likely. I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like... BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I'm confused about which is closest to the final 1.8.1 :-/ b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot in time from the branch) which is getting tested from that branch which has the potential to become 1.8.1 if the testing pans out. It is entirely possible that even when v1_8_1_RCn becomes the final release, there will be patches dangling on the tip of b_release_1_8_1 that are not release blockers but there in case we need a 1.8.1.1. So the choice is yours. If you want to be using exactly what could potentially be the GA release, you should stick to using the most recent tags. If you want to test ahead of what could be the GA, use the branch tip. cool. thanks for the explanation. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
On Wed, Jul 15, 2009 at 11:59:54AM -0400, Brian J. Murrell wrote: On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote: another data point is that the above errors don't happen with 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel OFED, so it's probably something that's happened between 1.8.0.1 and 1.8.1-pre. or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and 2.6.18-128.1.16.el5, but that seems less likely. I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like... Yeah, it would be a great troubleshooting addition to see if the same kernel on the clients and servers with the different lustre versions has the same problem. This would isolate the problem either to or away from a problem with the difference in OFED stacks. ok - I made a 2.6.18-128.1.14.el5 with b_release_1_8_1 and it behaves the same as 2.6.18-128.1.16.el5 with b_release_1_8_1. ie. spits out a bunch of errors on the first lustre mount. the only changes between those rhel .14 and .16 versions looks pretty unrelated to IB/lnet, so I guess that was to be expected: * Sat Jun 27 2009 Jiri Pirko jpi...@redhat.com [2.6.18-128.1.16.el5] - [mm] prevent panic in copy_hugetlb_page_range (Larry Woodman ) [508030 507860] * Tue Jun 23 2009 Jiri Pirko jpi...@redhat.com [2.6.18-128.1.15.el5] - [mm] fix swap race condition in fork-gup-race patch (Andrea Arcangeli) [507297 506684] so I guess the change is between Lustre 1.8.0.1 and b_release_1_8_1-20090712131220 somewhere. if only we had git bisect, and if only I knew how to use it, and if only I had the time to try it... :-) cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote: I get a ferocious set of error messages when I mount a 1.6.7.2 filesystem on a b_release_1_8_1 client. is this expected? just to annotate the below a bit in case it's not clear... sorry - should have done that in the first email :-/ 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the fs is mounted in the usual failover way eg. mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system from the below (and other similar logs) it kinda looks like the client fails and then renegotiates with all the servers. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: mgc10.8.30@o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 looks like it succeeds in the end, but only after a struggle. I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2. servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group quota fix). client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from cvs 20090712131220 + bz18793 again. BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I'm confused about which is closest to the final 1.8.1 :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Announce: Lustre 1.8.0.1 is available!
On Mon, Jun 22, 2009 at 08:30:56PM -0700, Terry Rutledge wrote: Hi all, Lustre 1.8.0.1 is available on the Sun Download Center Site. http://www.sun.com/software/products/lustre/get.jsp the 1.8.0.1 download link on that page looks to be wrong... it should point to 1801, but it points to 180. so currently the 1.8.0.1 page is identical to the 1.8.0 page. cheers, robin The change log and release notes can be read here: http://wiki.lustre.org/index.php/Use:Change_Log_1.8 Thank you for your assistance; as always, you can report issues via Bugzilla (https://bugzilla.lustre.org/) Happy downloading! -- The Lustre Team -- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6
Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] no handle for file close
On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu wrote: I am getting quite a few errors similar to the following error on the MDS server which is running the latest 1.6.7.1 patched kernel. The clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel and this cluster has 130 nodes/Lustre clients and uses GigE network. May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769: cookie 0xcfe66441310829d4 r...@8101ca8a3800 x2681218/t0 o35-fedc91f9-4de7-c789-6bdd-1de1f5e3d...@net_0x2c0a8f109_uuid:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0 May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) r...@8101ca8a3800 x2681218/t0 o35-fedc91f9-4de7-c789-6bdd-1de1f5e3d...@net_0x2c0a8f109_uuid:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0 I don't see the same errors on another cluster/Lustre installation with 2000 Lustre clients which uses Infiniband network. we see this sometimes when a job that is using a shared library that lives on Lustre is killed - presumably the un-memorymapping of the .so from a bunch of nodes at once confuses Lustre a bit. what is your inode 772769? eg. find -inum 772769 /some/lustre/fs/ if the file is a .so then that would be similar to what we are seeing. so we have this listed in the probably harmless section of the errors that we get from Lustre, so if it's not harmless than we'd very much like to know about it :) this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless 1.6.4.3 on clients w/ 2.6.23.17 kernels. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility I looked at the following bugs 19328, 18946, 18192 and 19085 but I am not sure if any of those bugs apply to this error. I would appreciate it someone could help me understand these errors and possibly suggest the solution. TIA Nirmal ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] root on lustre and timeouts
On Thu, Apr 30, 2009 at 12:51:00PM -0400, Brian J. Murrell wrote: On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote: BTW, as was pointed out in one talk of this years LUG, Lustre 1.8's OSS read cache should help things like root-on-Lustre because small commonly used files will likely be cached in the OSS's and won't result in disk accesses. Yes, imagine what the ROSS cache can do for 150 clients all booting (and executing the same scripts/binaries) at the same time. Imagine what the OSS disk did/does before the cache. :-) hopefully most of the frequently used parts of the OS are in page cache on clients after the first read or two, but if there are new parts accessed (or if everything boots at once) then yes, the OSS read cache should definitely help lots. currently the only load we notice from root-on-Lustre is on the MDS, but I can't say we've been actively monitoring and categorising all the traffic - we really haven't felt the need because there haven't been slowdowns to speak of - that's a good thing :) actually, just thinking about it, it'd be good if you could tell Lustre (llite) to be lazy about re-stat'ing files in what is mostly an un-changing read-only image. is it possible to do this? Certainly, I am not without bias, but the feature set of 1.8 looks compelling enough to make me want to upgrade my own little dogfood cluster here to 1.8. :-) yes, the features are shiny :) cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] root on lustre and timeouts
we are (happily) using read-only root-on-Lustre in production with oneSIS, but have noticed something odd... if a root-on-Lustre client node has been up for more than 10 or 12hours then it survives an MDS failure/failover/reboot event(*), but if the client is newly rebooted and has been up for less than this time, then it doesn't successfully reconnect after an MDS event and the node is ~dead. by trial and error I've also found that if I rsync /lib64, /bin, and /sbin from Lustre to a root ramdisk, 'echo 3 /proc/sys/vm/drop_caches', and symlink the rest of dirs to Lustre then the node sails through MDS events. leaving out any one of the dirs/steps leads to a dead node. so it looks like the Lustre kernel's recovery process is somehow tied to userspace via apps in /bin and /sbin? I can reproduce the weird 10-12hr behaviour at will by changing the clock on nodes in a toy Lustre test setup. ie. - servers and client all have the correct time - reboot client node - stop ntpd everywhere - use 'date --set ...' to set all clocks to be X hours in the future - cause a MDS event(*) - wait for recovery to complete - if X = ~10 to 12 then the client will be dead it's no big deal to put those 3 dirs into ramdisk as they're really small (and the part-on-ramdisk model is nice and flexible too), so we'll probably move to running in this way anyway, but I'm still curious as to why a kernel-only system like Lustre a) cares about userspace at all during recovery b) why it has a 10-12hr timescale :-) changing the contents of /proc/sys/lnet/upcall into some path stat'able without Lustre being up doesn't change anything. BTW, OSS reboot/failover is handled fine with root on Lustre, as are regular (non-root on Lustre clients) - this behaviour seems to be limited to the MDS/MGS failure when all/almost-all of the OS is on Lustre. our setup is patchless 1.6.4.3 clients, 1.6.6 servers, rhel5.2/5.3 x86_64, but the behaviour seems the same with much newer Lustre too eg. patched b_release_1_8_0. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility (*) umount mdt and mgs, lustre_rmmod, wait 10 mins, mount them again ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] tuning max_sectors
On Fri, Apr 17, 2009 at 07:25:30AM -0400, Brian J. Murrell wrote: On Fri, 2009-04-17 at 13:08 +0200, Götz Waschk wrote: Lustre: zn_atlas-OST: underlying device cciss/c1d0p1 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2048 we have a similar problem. Lustre: short-OST0001: underlying device md0 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=1280 What can I do? IIRC, that's in reference to /sys/block/$device/queue/max_sectors_kb. If you inspect that it should report 1024. You can simply echo a new value into that the way you can with /proc variables. sadly, that sys entry doesn't exist: cat: /sys/block/md0/queue/max_sectors_kb: No such file or directory do you have any other suggestions? perhaps the devices below md need looking at? they all report /sys/block/sd*/queue/max_sectors_kb == 512. we have an md raid6 8+2. uname -a Linux sox2 2.6.18-92.1.10.el5_lustre.1.6.6.fixR5 #2 SMP Wed Feb 4 16:58:30 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (which is 1.6.6 + the patch from bz 15428 which is (I think) now in 1.6.7.1) cat /proc/mdstat ... md0 : active raid6 sdc[0] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] 5860595712 blocks level 6, 64k chunk, algorithm 2 [10/10] [UU] in: 64205147 reads, 97489370 writes; out: 3730773413 reads, 3281459807 writes 983790 in raid5d, 498868 out of stripes, 4280451425 handle called reads: 0 for rmw, 709671189 for rcw. zcopy writes: 1573400576, copied writes: 20983045 0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out 0 expanding overlap cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] e2fsck
On Sat, Feb 21, 2009 at 04:13:49PM -0700, Andreas Dilger wrote: On Feb 21, 2009 01:09 -0500, Robin Humble wrote: On Fri, Feb 20, 2009 at 02:10:50PM -0700, Andreas Dilger wrote: On Feb 19, 2009 20:42 -0500, Robin Humble wrote: in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks count wrong and often a bitmap in a group that wants to be corrected. is this normal? or is it an ldiskfs or an e2fsck bug? Do you have the MMP feature enabled? no, MMP is off. there is a small chance that this is the first time the partitions have been fsck'd since MMP was turned off though - I can't be sure about that. That would probably be the cause - the MMP function uses a single block, and it needs to be freed by e2fsck when the feature is disabled. We should probably fix tune2fs to do this at the time MMP is turned off. awesome diagnosis! # e2fsck -f /dev/md0 e2fsck 1.40.11.sun1 (17-June-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information short-OST: 13/366190592 files (7.7% non-contiguous), 22998875/1464758400 blocks # tune2fs -O ^mmp /dev/md0 tune2fs 1.40.11.sun1 (17-June-2008) # e2fsck -f /dev/md0 e2fsck 1.40.11.sun1 (17-June-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #0 (31222, counted=31223). Fixy? yes Free blocks count wrong (1441759525, counted=1441759526). Fixy? yes short-OST: * FILE SYSTEM WAS MODIFIED * short-OST: 13/366190592 files (7.7% non-contiguous), 22998874/1464758400 blocks cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] e2fsck
On Fri, Feb 20, 2009 at 02:10:50PM -0700, Andreas Dilger wrote: On Feb 19, 2009 20:42 -0500, Robin Humble wrote: in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks count wrong and often a bitmap in a group that wants to be corrected. is this normal? or is it an ldiskfs or an e2fsck bug? Do you have the MMP feature enabled? no, MMP is off. there is a small chance that this is the first time the partitions have been fsck'd since MMP was turned off though - I can't be sure about that. we have MMP off because when e2fsck or tune2fs crashes (eg. out of memory, or when tune2fs goes recursively looking for journal devices that don't exist) then it makes the MMP'd partition unusable. cheers, robin rhel5 x86_64 e2fsprogs-1.40.11.sun1-0redhat kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6 cheers, robin [r...@sox2 ~]# e2fsck -f /dev/md5 e2fsck 1.40.11.sun1 (17-June-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -107639 Fixy? yes Free blocks count wrong for group #3 (19179, counted=19180). Fixy? yes Free blocks count wrong (8199819, counted=8199820). Fixy? yes system-OST0001: * FILE SYSTEM WAS MODIFIED * system-OST0001: 133986/3055616 files (1.2% non-contiguous), 4007188/12207008 blocks [r...@sox2 ~]# e2fsck -f /dev/md6 e2fsck 1.40.11.sun1 (17-June-2008) home-OST0001: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (23432, counted=23433). Fixy? yes Free blocks count wrong (131098913, counted=131098914). Fixy? yes home-OST0001: * FILE SYSTEM WAS MODIFIED * home-OST0001: 26848/33513472 files (2.4% non-contiguous), 2934270/134033184 blocks [r...@sox2 ~]# e2fsck -f /dev/md7 e2fsck 1.40.11.sun1 (17-June-2008) apps-OST0001: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (23432, counted=23433). Fixy? yes Free blocks count wrong (34865220, counted=34865221). Fixy? yes apps-OST0001: * FILE SYSTEM WAS MODIFIED * apps-OST0001: 45904/9166848 files (3.9% non-contiguous), 1794027/36659248 blocks [r...@sox2 ~]# [r...@sox2 ~]# e2fsck -f /dev/md15 e2fsck 1.40.11.sun1 (17-June-2008) system-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (20647, counted=20648). Fixy? yes Free blocks count wrong (8115827, counted=8115828). Fixy? yes system-OST: * FILE SYSTEM WAS MODIFIED * system-OST: 134002/3055616 files (1.2% non-contiguous), 4091180/12207008 blocks [r...@sox2 ~]# e2fsck -f /dev/md16 e2fsck 1.40.11.sun1 (17-June-2008) home-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information home-OST: * FILE SYSTEM WAS MODIFIED * home-OST: 26831/33513472 files (2.1% non-contiguous), 2951394/134033184 blocks [r...@sox2 ~]# e2fsck -f /dev/md17 e2fsck 1.40.11.sun1 (17-June-2008) apps-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (3046, counted=3047). Fixy? yes Free blocks count wrong (34976431, counted=34976432). Fixy? yes apps-OST: * FILE SYSTEM WAS MODIFIED * apps-OST: 45798/9166848 files (3.7% non-contiguous), 1682816/36659248 blocks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] e2fsck
in 5 out of 6 e2fsck's I do after an OSS crash, I get one free blocks count wrong and often a bitmap in a group that wants to be corrected. is this normal? or is it an ldiskfs or an e2fsck bug? rhel5 x86_64 e2fsprogs-1.40.11.sun1-0redhat kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6 cheers, robin [r...@sox2 ~]# e2fsck -f /dev/md5 e2fsck 1.40.11.sun1 (17-June-2008) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -107639 Fixy? yes Free blocks count wrong for group #3 (19179, counted=19180). Fixy? yes Free blocks count wrong (8199819, counted=8199820). Fixy? yes system-OST0001: * FILE SYSTEM WAS MODIFIED * system-OST0001: 133986/3055616 files (1.2% non-contiguous), 4007188/12207008 blocks [r...@sox2 ~]# e2fsck -f /dev/md6 e2fsck 1.40.11.sun1 (17-June-2008) home-OST0001: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (23432, counted=23433). Fixy? yes Free blocks count wrong (131098913, counted=131098914). Fixy? yes home-OST0001: * FILE SYSTEM WAS MODIFIED * home-OST0001: 26848/33513472 files (2.4% non-contiguous), 2934270/134033184 blocks [r...@sox2 ~]# e2fsck -f /dev/md7 e2fsck 1.40.11.sun1 (17-June-2008) apps-OST0001: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (23432, counted=23433). Fixy? yes Free blocks count wrong (34865220, counted=34865221). Fixy? yes apps-OST0001: * FILE SYSTEM WAS MODIFIED * apps-OST0001: 45904/9166848 files (3.9% non-contiguous), 1794027/36659248 blocks [r...@sox2 ~]# [r...@sox2 ~]# e2fsck -f /dev/md15 e2fsck 1.40.11.sun1 (17-June-2008) system-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (20647, counted=20648). Fixy? yes Free blocks count wrong (8115827, counted=8115828). Fixy? yes system-OST: * FILE SYSTEM WAS MODIFIED * system-OST: 134002/3055616 files (1.2% non-contiguous), 4091180/12207008 blocks [r...@sox2 ~]# e2fsck -f /dev/md16 e2fsck 1.40.11.sun1 (17-June-2008) home-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information home-OST: * FILE SYSTEM WAS MODIFIED * home-OST: 26831/33513472 files (2.1% non-contiguous), 2951394/134033184 blocks [r...@sox2 ~]# e2fsck -f /dev/md17 e2fsck 1.40.11.sun1 (17-June-2008) apps-OST: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #3 (3046, counted=3047). Fixy? yes Free blocks count wrong (34976431, counted=34976432). Fixy? yes apps-OST: * FILE SYSTEM WAS MODIFIED * apps-OST: 45798/9166848 files (3.7% non-contiguous), 1682816/36659248 blocks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] speedy server shutdown
Hi, when shutting down our OSS's and then MDS's we often wait 330s for each set of umount's to finish eg. Feb 2 03:20:06 xemds2 kernel: Lustre: Mount still busy with 68 refs, waiting for 330 secs... Feb 2 03:20:11 xemds2 kernel: Lustre: Mount still busy with 68 refs, waiting for 325 secs... ... is there a way to speed this up? we're interested in the (perhaps unusual) case where all clients are gone because the power has failed, and the Lustre servers are running on UPS and need to be shut down ASAP. the tangible reward for a quick shutdown is that we can buy a lower capacity (cheaper) UPS if we can reliably and cleanly shutdown all the Lustre servers in 10mins, and preferably 3 minutes. if we're tweaking timeouts to do this then hopefully we can tweak them just before the shutdown and avoid running short timeouts in normal operation. I'm probably missing something obvious, but I have looked through a bunch of /proc/{fs/lustre,sys/lnet,sys/lustre} entries and the Operations Manual and I can't actually see where the default 330s comes from... ??? it seems to be quite repeatable for both OSS's and MDS's. we're using Lustre 1.6.6 or 1.6.5.1 on servers and patchless 1.6.4.3 on clients with x86_64 RHEL 5.2 everywhere. thanks for any help! cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] open() ENOENT bug
On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom wrote: On Thursday 30 October 2008, Brian J. Murrell wrote: On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote: we have a user with simultaneously starting fortran runs that fail about 10% of the time because Lustre sometimes returns ENOENT instead of EACCES to an open() request on a read-only file. I can reproduce this on 1.6.6 as well using your reproducer. We have also seen this bug on our systems (reported by a user running a Fortran code). We have servers with both 1.4 (2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 (2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre. The error is seen towards both server versions from a cluster with patchless 1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5). However the error is not seen from another cluster running _patched_ 1.6.5.1 on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp). I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who'd have thought it'd boot with a RHEL5 userland!? :-) and you are right - my openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot more robust than 2.6.18 and onwards. however, when running ~10 copies of the below fortran code with the above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always die with: Fortran runtime error: Stale NFS file handle program blah implicit none integer i do i=1,1000 open(3,file='file',status='old') close(3) enddo stop end so although my cut-down C code reproducer doesn't trigger anything, it seems Lustre still has issues with the real fortran code. the user's jobs would probably run ok in this RHEL4 environment though as they don't run 10 copies at once. it's a slightly different variant of the bug as well (different error code), or maybe it's just a totaly different bug. cheers, robin /Peter Can you file a bug in our bugzilla about it? Please include your reproducer program. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] open() ENOENT bug
On Thu, Oct 30, 2008 at 08:28:05AM -0400, Brian J. Murrell wrote: On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote: we have a user with simultaneously starting fortran runs that fail about 10% of the time because Lustre sometimes returns ENOENT instead of EACCES to an open() request on a read-only file. I can reproduce this on 1.6.6 as well using your reproducer. thanks for looking into it so quickly. Can you file a bug in our bugzilla about it? Please include your reproducer program. https://bugzilla.lustre.org/show_bug.cgi?id=17545 cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] raid5 patches for rhel5
On Fri, Aug 01, 2008 at 01:51:36PM -0600, Andreas Dilger wrote: On Aug 01, 2008 09:38 -0400, Robin Humble wrote: done, and yes, performance is largely the same as RHEL4. cool! Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP rhel4 oss 16G:256k 84624 99 842138 92 310044 91 77675 99 491239 96 285.8 10 rhel5 oss 16G:256k 86085 99 827731 95 327007 97 79639 100 495487 98 456.2 18 streaming writes are down marginally on rhel5, but seeks/s are up 50%. Good to know, thanks. BTW - the above is with 1.6.4.3 clients. Is this with 1.6.5 servers or 1.6.4.3 servers? that's with 1.6.5.1 RHEL5 servers. 1.6.5.1 client still perform badly for us. eg. Have you tried disabling the checksums? lctl set_param osc.*.checksums=0 yes, checksums were disabled. Note that 1.6.5 clients - 1.6.5 servers with checksums enabled will perform better than mixed client/server because 1.6.5 has a more efficient checksum algorithm. Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16G:256k 77216 99 462659 100 296050 96 68100 81 648350 93 422.2 13 which shows better streaming writes, but ~1/2 the streaming read speed :-( You are getting that backward... 55% of the previous write speed, 90% of the previous overwrite speed, and 130% of the previous read speed. doh! yes, backwards... that was patchless 2.6.23 clients BTW. Note that there are also similar performance improvements for RAID-6. I can't see the RAID6 patches in the tree for RHEL5... am I missing something? Sigh, RAID6 patches were ported to RHEL4, but not RHEL5... I've filed bug 16587 about that, but have no idea when it will be completed. cool - thanks! cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] raid5 patches for rhel5
On Sun, Jul 20, 2008 at 11:08:41PM -0600, Andreas Dilger wrote: On Jul 18, 2008 08:39 -0400, Robin Humble wrote: I notice that Lustre 1.6.5 brings with it the md layer RAID5 patches for RHEL5 kernels. thanks! :-) are all the RHEL4 optimisations there, so we should get the same performance if we now move our OSS's to RHEL5? That is my understanding, yes. done, and yes, performance is largely the same as RHEL4. cool! Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP rhel4 oss 16G:256k 84624 99 842138 92 310044 91 77675 99 491239 96 285.8 10 rhel5 oss 16G:256k 86085 99 827731 95 327007 97 79639 100 495487 98 456.2 18 streaming writes are down marginally on rhel5, but seeks/s are up 50%. BTW - the above is with 1.6.4.3 clients. 1.6.5.1 client still perform badly for us. eg. Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16G:256k 77216 99 462659 100 296050 96 68100 81 648350 93 422.2 13 which shows better streaming writes, but ~1/2 the streaming read speed :-( Note that there are also similar performance improvements for RAID-6. I can't see the RAID6 patches in the tree for RHEL5... am I missing something? cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.6.5.1 OSS crashes
On Sun, Jul 20, 2008 at 08:40:19AM -0400, Mag Gam wrote: I am trying to understand. What was the problem? How does SD_IOSTATS affect the crash? How did you disable this? the comments describe the bug: https://bugzilla.lustre.org/show_bug.cgi?id=16404#c22 which from a quick look seems like a SMP locking issue around the statistics collection issue that presumable under some circumstances can cause an overflow and a crash. the way to disable it is to rebuild the patched-by-Lustre RHEL kernel with the CONFIG_SD_IOSTATS option turned off. Sorry for a newbie question no probs. let me know if you need a recipe for patching and rebuilding this kernel. I should really write it all down before I forget anyway... there are most likely descriptions for patching and building kernels on the Lustre wiki too. cheers, robin On Sun, Jul 20, 2008 at 4:54 AM, Robin Humble [EMAIL PROTECTED] wrote: On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote: On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote: Hi, I'm seeing coordinated OSS crashes with Lustre 1.6.5.1. our RHEL4 OSS have been stable for ~months with these kernels: kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3 kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2 but have crashed hard, twice, about 10hrs apart as soon as we started using this kernel: kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1 Can you try rebuilding the kernel, disabling SD_IOSTATS? done. I rebuilt using the stock kernel's InfiniBand stack and # CONFIG_SD_IOSTATS is not set % cexec -p oss: uptime oss x17: 18:45:07 up 1 day, 30 min, 1 user, load average: 4.97, 7.00, 6.27 oss x18: 18:45:07 up 1 day, 23 min, 1 user, load average: 4.18, 5.78, 5.71 oss x19: 18:45:07 up 1 day, 23 min, 1 user, load average: 5.18, 5.66, 4.60 which is the 10hrs it was crashing at before. good guess about the cause of the problem! :-) maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a fairly critical issue... :-/ cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.6.5.1 OSS crashes
On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote: On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote: Hi, I'm seeing coordinated OSS crashes with Lustre 1.6.5.1. our RHEL4 OSS have been stable for ~months with these kernels: kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3 kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2 but have crashed hard, twice, about 10hrs apart as soon as we started using this kernel: kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1 Can you try rebuilding the kernel, disabling SD_IOSTATS? done. I rebuilt using the stock kernel's InfiniBand stack and # CONFIG_SD_IOSTATS is not set % cexec -p oss: uptime oss x17: 18:45:07 up 1 day, 30 min, 1 user, load average: 4.97, 7.00, 6.27 oss x18: 18:45:07 up 1 day, 23 min, 1 user, load average: 4.18, 5.78, 5.71 oss x19: 18:45:07 up 1 day, 23 min, 1 user, load average: 5.18, 5.66, 4.60 which is the 10hrs it was crashing at before. good guess about the cause of the problem! :-) maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a fairly critical issue... :-/ cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Looping in __d_lookup
On Tue, May 20, 2008 at 10:24:25PM +0200, Jakob Goldbach wrote: Hm, so you actually have a circular loop? Yes - I've asked for help on the OpenVZ list as well - Pavel Emelyanov provided me with a debug patch. This patch has now confirmed the loop in __d_lookup. we're also seeing __d_lookup soft lockups. 4 are attached. one was a cached_lookup but the first line of that fn is a __d_lookup which is where I suspect is where the real soft lockup occurred. after a while the node is toast and has to be rebooted. kernel is 2.6.23.17 with patchless lustre 1.6.4.3, modified with patches from bz 14250 (attachment 14109) and 13378 (attachment 12276). using o2ib, x86_64, rhel5.1 cheers, robin ps. I'm glad to see some Lustre support for modern linux kernels and would like to see a lot more! the VM's of distro kernel sometimes behave erratically enough that we can't use them on Lustre client nodes in production. May 6 14:36:53 x6 kernel: BUG: soft lockup - CPU#2 stuck for 11s! [rsync:14200] May 6 14:36:53 x6 kernel: CPU 2: May 6 14:36:53 x6 kernel: Modules linked in: ext3 jbd loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i2c_i801 i2c_core ahci ata_piix libata ib_mthca scsi_mod i5000_edac ib_mad edac_core ib_core ehci_hcd shpchp uhci_hcd rng_core button nfs nfs_acl lockd sunrpc e1000 May 6 14:36:53 x6 kernel: Pid: 14200, comm: rsync Not tainted 2.6.23.17 #1 May 6 14:36:53 x6 kernel: RIP: 0010:[80282cef] [80282cef] __d_lookup+0xed/0x110 May 6 14:36:53 x6 kernel: RSP: 0018:81011b7ebbe8 EFLAGS: 0286 May 6 14:36:53 x6 kernel: RAX: 81011d71aea0 RBX: 81011d71aea0 RCX: 0014 May 6 14:36:53 x6 kernel: RDX: 000cb941 RSI: 81011b7ebcb8 RDI: 810120522cb8 May 6 14:36:53 x6 kernel: RBP: 8101d2707408 R08: 0007 R09: 0007 May 6 14:36:53 x6 kernel: R10: 7fffb750 R11: 802c7d3e R12: 8853dc60 May 6 14:36:53 x6 kernel: R13: 8100992a5470 R14: 81024adb3af8 R15: 810219defe80 May 6 14:36:53 x6 kernel: FS: 2b6296e0() GS:81025fc6d840() knlGS: May 6 14:36:53 x6 kernel: CS: 0010 DS: ES: CR0: 8005003b May 6 14:36:53 x6 kernel: CR2: 0079f018 CR3: 00018c1d CR4: 06e0 May 6 14:36:53 x6 kernel: DR0: DR1: DR2: May 6 14:36:54 x6 kernel: DR3: DR6: 0ff0 DR7: 0400 May 6 14:36:54 x6 kernel: May 6 14:36:54 x6 kernel: Call Trace: May 6 14:36:54 x6 kernel: [80282cd5] __d_lookup+0xd3/0x110 May 6 14:36:54 x6 kernel: [80279b1e] do_lookup+0x2a/0x1ae May 6 14:36:54 x6 kernel: [8027bc4d] __link_path_walk+0x924/0xde9 May 6 14:36:54 x6 kernel: [8027c16a] link_path_walk+0x58/0xe0 May 6 14:36:54 x6 kernel: [8027c536] do_path_lookup+0x1ab/0x1cf May 6 14:36:54 x6 kernel: [8027afe4] getname+0x14c/0x191 May 6 14:36:54 x6 kernel: [8027cd67] __user_walk_fd+0x37/0x53 May 6 14:36:54 x6 kernel: [80275a7b] vfs_lstat_fd+0x18/0x47 May 6 14:36:54 x6 kernel: [80275c6d] sys_newlstat+0x19/0x31 May 6 14:36:54 x6 kernel: [8020b3ae] system_call+0x7e/0x83 May 8 09:32:36 x10 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [bonnie++.mpi:31720] May 8 09:32:36 x10 kernel: CPU 1: May 8 09:32:36 x10 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac edac_core ehci_hcd ahci rng_core ata_piix libata i2c_i801 scsi_mod i2c_core ib_mthca ib_mad shpchp uhci_hcd ib_core button nfs nfs_acl lockd sunrpc e1000 May 8 09:32:36 x10 kernel: Pid: 31720, comm: bonnie++.mpi Not tainted 2.6.23.17 #1 May 8 09:32:36 x10 kernel: RIP: 0010:[80282cef] [80282cef] __d_lookup+0xed/0x110 May 8 09:32:36 x10 kernel: RSP: 0018:81014248fd78 EFLAGS: 0286 May 8 09:32:36 x10 kernel: RAX: 8101a79cc500 RBX: 8101a79cc500 RCX: 0014 May 8 09:32:36 x10 kernel: RDX: 000ac16f RSI: 81014248feb8 RDI: 8101ff617898 May 8 09:32:36 x10 kernel: RBP: 8101ff617898 R08: R09: 81025fd3d5c0 May 8 09:32:37 x10 kernel: R10: 0001 R11: 802c7d3e R12: 8027c1e0 May 8 09:32:37 x10 kernel: R13: 81014248fea8 R14: R15: 81025fd3d5c0 May 8 09:32:37 x10 kernel: FS: 2c95a990() GS:81025fc6de40() knlGS: May 8 09:32:37 x10 kernel: CS: 0010 DS: ES: CR0: 8005003b May 8 09:32:37 x10 kernel: CR2:
Re: [Lustre-discuss] Looping in __d_lookup
On Wed, May 21, 2008 at 08:04:27PM -0600, Andreas Dilger wrote: On May 21, 2008 21:05 +0200, Jakob Goldbach wrote: I'm running 1.6.4.3 patchless as well against an 2.6.18 vanilla kernel. Or at least that is what I thought. OpenVz patch effectively makes the kernel a 2.6.18++ kernel because they add features from newer kernels in their maintained 2.6.18 based kernel. So the lockup in __d_lookup may just relate to newer patchless clients. I got a debug patch from the OpenVz community which indicate dcache chain corruption in a lustre code path. Do you have the fixes for the statahead patches, disable statahead via echo 0 /proc/fs/lustre/llite/*/statahead_max, or can you try out the v1_6_5_RC4 tag from CVS (which also contains those patches)? all of our __d_lookup soft lockups have been when running with 0 in /proc/fs/lustre/llite/*/statahead_max I'll try v1_6_5_RC4 now - should be fun :-) cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: Some more information that might be helpful. There is a particular code that one of our users runs. Personally after the trouble this code has caused us we'd like to hand him a calculator and disable his accounts but sadly that's not an option. Since the time of the hang, there is what seems to be one process associated with lustre that is running as the userid of the problem user- ll_sa_15530. A trace of this process in its current state shows this - Is this a problem with the lustre readahead code? If so would this fix it? echo 0 /proc/fs/lustre/llite/*/statahead_count Yes, this appears to be a statahead problem. There were fixes added to 1.6.5 that should resolve the problems seen with statahead. In the meantime I'd recommend disabling it as you suggest above. we're seeing the same problem. I think the workaround should be: echo 0 /proc/fs/lustre/llite/*/statahead_max ?? /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- cheers, robin ps. sorry I've been too busy this week to look at the llite_lloop stuff. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger wrote: On Apr 29, 2008 09:05 -0700, Kilian CAVALOTTI wrote: /scratch # swapon -a ./swapfile swapon: ./swapfile: Invalid argument Note that in 1.6.4+ there is an interface to export a block device more directly from Lustre instead of using the loopback driver on top of the client filesystem. This is the llite_loop module and is configured like: lctl blockdev_attach {loopback_filename} {blockdev_filename} where {loopback_filename} is the file that should be turned into a block device (it can be sparse if desired) and {blockdev_filename} is the full filename that the new block device should be created at. cool! I was wondering that that module was for. I'm trying to use it like: lctl blockdev_attach /dev/lloop0 /some/file/on/lustre but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file goes at = 100MB/s. am I doing something wrong? cheers, robin To clean up the device use: lctl blockdev_detach {blockdev_filename} Note that using this block device for swap hasn't been very successful in our testing so far, but we also haven't done a great deal of real world testing only allocate a ton of RAM and dirty it all as fast as possible, which isn't a very realistic usage. Feedback would be welcome. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss