Thank you Chad for answering, We are using the patched kernel on the MDT/OSS The problem is in the group space quota. In any case I enabled project quota just for future purposes. There are no defined projects, do you think it can still pose a problem?
Best, David On Wed, Aug 26, 2020 at 3:18 PM Chad DeWitt <ccdew...@uncc.edu> wrote: > Hi David, > > Hope you're doing well. > > This is a total shot in the dark, but depending on the kernel version you > are running, you may need a patched kernel to use project quotas. I'm not > sure what the symptoms would be, but it may be worth turning off project > quotas and seeing if doing so resolves your issue: > > lctl conf_param technion.quota.mdt=none > lctl conf_param technion.quota.mdt=ug > lctl conf_param technion.quota.ost=none > lctl conf_param technion.quota.ost=ug > > (Looks like you have been running project quota on your MDT for a while > without issue, so this may be a deadend.) > > Here's more info concerning when a patched kernel is necessary for > project quotas (25.2. Enabling Disk Quotas): > > http://doc.lustre.org/lustre_manual.xhtml > > > Cheers, > Chad > > ------------------------------------------------------------ > > Chad DeWitt, CISSP | University Research Computing > > UNC Charlotte *| *Office of OneIT > > ccdew...@uncc.edu > > ------------------------------------------------------------ > > > > On Tue, Aug 25, 2020 at 3:04 AM David Cohen <cda...@physics.technion.ac.il> > wrote: > >> [*Caution*: Email from External Sender. Do not click or open links or >> attachments unless you know this sender.] >> >> Hi, >> Still hoping for a reply... >> >> It seems to me that old groups are more affected by the issue than new >> ones that were created after a major disk migration. >> It seems that the quota enforcement is somehow based on a counter other >> than the accounting as the accounting produces the same numbers as du. >> So if quota is calculated separately from accounting, it is possible that >> quota is broken and keeps values from removed disks, while accounting is >> correct. >> So following that suspicion I tried to force the FS to recalculate quota. >> I tried: >> lctl conf_param technion.quota.ost=none >> and back to: >> lctl conf_param technion.quota.ost=ugp >> >> I tried running on mds and all ost: >> tune2fs -O ^quota >> and on again: >> tune2fs -O quota >> and after each attempt, also: >> lctl lfsck_start -A -t all -o -e continue >> >> But still the problem persists and groups under the quota usage get >> blocked with "quota exceeded" >> >> Best, >> David >> >> >> On Sun, Aug 16, 2020 at 8:41 AM David Cohen < >> cda...@physics.technion.ac.il> wrote: >> >>> Hi, >>> Adding some more information. >>> A Few months ago the data on the Lustre fs was migrated to new physical >>> storage. >>> After successful migration the old ost were marked as active=0 >>> (lctl conf_param technion-OST0001.osc.active=0) >>> >>> Since then all the clients were unmounted and mounted. >>> tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost. >>> lctl dl don't show the old ost anymore, but when querying the quota they >>> still appear. >>> As I see that new users are less affected by the "quota exceeded" >>> problem (blocked from writing while quota is not filled), >>> I suspect that quota calculation is still summing values from the old >>> ost: >>> >>> *lfs quota -g -v md_kaplan /storage/* >>> Disk quotas for grp md_kaplan (gid 10028): >>> Filesystem kbytes quota limit grace files quota limit >>> grace >>> /storage/ 4823987000 0 5368709120 - 143596 0 >>> 0 - >>> technion-MDT0000_UUID >>> 37028 - 0 - 143596 - 0 >>> - >>> quotactl ost0 failed. >>> quotactl ost1 failed. >>> quotactl ost2 failed. >>> quotactl ost3 failed. >>> quotactl ost4 failed. >>> quotactl ost5 failed. >>> quotactl ost6 failed. >>> quotactl ost7 failed. >>> quotactl ost8 failed. >>> quotactl ost9 failed. >>> quotactl ost10 failed. >>> quotactl ost11 failed. >>> quotactl ost12 failed. >>> quotactl ost13 failed. >>> quotactl ost14 failed. >>> quotactl ost15 failed. >>> quotactl ost16 failed. >>> quotactl ost17 failed. >>> quotactl ost18 failed. >>> quotactl ost19 failed. >>> quotactl ost20 failed. >>> technion-OST0015_UUID >>> 114429464* - 114429464 - - - >>> - - >>> technion-OST0016_UUID >>> 92938588 - 92938592 - - - >>> - - >>> technion-OST0017_UUID >>> 128496468* - 128496468 - - - >>> - - >>> technion-OST0018_UUID >>> 191478704* - 191478704 - - - >>> - - >>> technion-OST0019_UUID >>> 107720552 - 107720560 - - - >>> - - >>> technion-OST001a_UUID >>> 165631952* - 165631952 - - - >>> - - >>> technion-OST001b_UUID >>> 460714156* - 460714156 - - - >>> - - >>> technion-OST001c_UUID >>> 157182900* - 157182900 - - - >>> - - >>> technion-OST001d_UUID >>> 102945952* - 102945952 - - - >>> - - >>> technion-OST001e_UUID >>> 175840980* - 175840980 - - - >>> - - >>> technion-OST001f_UUID >>> 142666872* - 142666872 - - - >>> - - >>> technion-OST0020_UUID >>> 188147548* - 188147548 - - - >>> - - >>> technion-OST0021_UUID >>> 125914240* - 125914240 - - - >>> - - >>> technion-OST0022_UUID >>> 186390800* - 186390800 - - - >>> - - >>> technion-OST0023_UUID >>> 115386876 - 115386884 - - - >>> - - >>> technion-OST0024_UUID >>> 127139556* - 127139556 - - - >>> - - >>> technion-OST0025_UUID >>> 179666580* - 179666580 - - - >>> - - >>> technion-OST0026_UUID >>> 147837348 - 147837356 - - - >>> - - >>> technion-OST0027_UUID >>> 129823528 - 129823536 - - - >>> - - >>> technion-OST0028_UUID >>> 158270776 - 158270784 - - - >>> - - >>> technion-OST0029_UUID >>> 168762120 - 168763104 - - - >>> - - >>> technion-OST002a_UUID >>> 164235684 - 164235688 - - - >>> - - >>> technion-OST002b_UUID >>> 147512200 - 147512204 - - - >>> - - >>> technion-OST002c_UUID >>> 158046652 - 158046668 - - - >>> - - >>> technion-OST002d_UUID >>> 199314048* - 199314048 - - - >>> - - >>> technion-OST002e_UUID >>> 209187196* - 209187196 - - - >>> - - >>> technion-OST002f_UUID >>> 162586732 - 162586764 - - - >>> - - >>> technion-OST0030_UUID >>> 131248812* - 131248812 - - - >>> - - >>> technion-OST0031_UUID >>> 134665176* - 134665176 - - - >>> - - >>> technion-OST0032_UUID >>> 149767512* - 149767512 - - - >>> - - >>> Total allocated inode limit: 0, total allocated block limit: 4823951056 >>> Some errors happened when getting quota info. Some devices may be not >>> working or deactivated. The data in "[]" is inaccurate. >>> >>> >>> *lfs quota -g -h md_kaplan /storage/* >>> Disk quotas for grp md_kaplan (gid 10028): >>> Filesystem used quota limit grace files quota limit >>> grace >>> /storage/ 4.493T 0k 5T - 143596 0 0 >>> - >>> >>> >>> >>> On Tue, Aug 11, 2020 at 7:35 AM David Cohen < >>> cda...@physics.technion.ac.il> wrote: >>> >>>> Hi, >>>> I'm running Lustre 2.10.5 on the oss and mds, and 2.10.7 on the clients. >>>> While inode quota ons mdt worked fine for a while now: >>>> lctl conf_param technion.quota.mdt=ugp >>>> When, few days ago I turned on quota on ost: >>>> lctl conf_param technion.quota.ost=ugp >>>> Users started getting "Disk quota exceeded" error messages while quota >>>> is not filled >>>> >>>> Actions taken: >>>> Full e2fsck -f -y to all the file system, mdt and ost. >>>> lctl lfsck_start -A -t all -o -e continue >>>> turning quota to none and back. >>>> >>>> None of the above solved the problem. >>>> >>>> lctl lfsck_query >>>> >>>> >>>> layout_mdts_init: 0 >>>> layout_mdts_scanning-phase1: 0 >>>> layout_mdts_scanning-phase2: 0 >>>> layout_mdts_completed: 0 >>>> layout_mdts_failed: 0 >>>> layout_mdts_stopped: 0 >>>> layout_mdts_paused: 0 >>>> layout_mdts_crashed: 0 >>>> *layout_mdts_partial: 1 *# is that normal output? >>>> layout_mdts_co-failed: 0 >>>> layout_mdts_co-stopped: 0 >>>> layout_mdts_co-paused: 0 >>>> layout_mdts_unknown: 0 >>>> layout_osts_init: 0 >>>> layout_osts_scanning-phase1: 0 >>>> layout_osts_scanning-phase2: 0 >>>> layout_osts_completed: 30 >>>> layout_osts_failed: 0 >>>> layout_osts_stopped: 0 >>>> layout_osts_paused: 0 >>>> layout_osts_crashed: 0 >>>> layout_osts_partial: 0 >>>> layout_osts_co-failed: 0 >>>> layout_osts_co-stopped: 0 >>>> layout_osts_co-paused: 0 >>>> layout_osts_unknown: 0 >>>> layout_repaired: 15 >>>> namespace_mdts_init: 0 >>>> namespace_mdts_scanning-phase1: 0 >>>> namespace_mdts_scanning-phase2: 0 >>>> namespace_mdts_completed: 1 >>>> namespace_mdts_failed: 0 >>>> namespace_mdts_stopped: 0 >>>> namespace_mdts_paused: 0 >>>> namespace_mdts_crashed: 0 >>>> namespace_mdts_partial: 0 >>>> namespace_mdts_co-failed: 0 >>>> namespace_mdts_co-stopped: 0 >>>> namespace_mdts_co-paused: 0 >>>> namespace_mdts_unknown: 0 >>>> namespace_osts_init: 0 >>>> namespace_osts_scanning-phase1: 0 >>>> namespace_osts_scanning-phase2: 0 >>>> namespace_osts_completed: 0 >>>> namespace_osts_failed: 0 >>>> namespace_osts_stopped: 0 >>>> namespace_osts_paused: 0 >>>> namespace_osts_crashed: 0 >>>> namespace_osts_partial: 0 >>>> namespace_osts_co-failed: 0 >>>> namespace_osts_co-stopped: 0 >>>> namespace_osts_co-paused: 0 >>>> namespace_osts_unknown: 0 >>>> namespace_repaired: 99 >>>> >>>> >>>> >>>> >>>> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org