Hi,
I'm not sure if this could be related to the following issue:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2025-January/019372.html
It appears to involve a similar Lustre version, quota-related issues,
and MDT instability.
In the referenced post, they reported that disabling quotas stabilized
the MDS for about a month.
El 12/03/2025 a las 22:22, Fredrik Nyström via lustre-discuss escribió:
Hi,
We had some similar problems in Sep-Oct 2024 running Lustre 2.15.5.
Limits on individual OSTs stops increasing, leading to writes becoming
slower and slower?
Check for "DQACQ failed" in /var/log/messages on Lustre servers.
Example, lots of lines like these for all OSTs:
2024-09-04T12:57:15.725917+02:00 oss170 kernel: LustreError:
1059853:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -3,
flags:0x1 qsd:rossby27-OST0003 qtype:grp id:8517 enforced:1 granted: 285682
pending:149320 waiting:13252405 req:1 usage: 85560 qunit:262144 qtune:65536
edquot:0 default:no
2024-09-04T12:57:15.726112+02:00 oss170 kernel: LustreError:
1059853:0:(qsd_handler.c:787:qsd_op_begin0()) $$$ ID isn't enforced on master, it
probably due to a legeal race, if this message is showing up constantly, there
could be some inconsistence between master & slave, and quota reintegration
needs be re-triggered. qsd:rossby27-OST0003 qtype:grp id:8517 enforced:1 granted:
285682 pending:149320 waiting:12591294 req:0 usage: 85560 qunit:262144 qtune:65536
edquot:0 default:no
2024-09-04T12:57:15.726138+02:00 oss170 kernel: LustreError:
1059853:0:(qsd_handler.c:787:qsd_op_begin0()) Skipped 20 previous similar
messages
If I remember correctly, group quota problems only affected a single
group. Ok after restart of Lustre servers, unmount of MDT triggered
kernel panic.
Kind Regards / Fredrik Nyström, NSC
On 2025-03-11 16:12, Robert Pennington wrote:
Hello,
We?re using Lustre 2.15.3 and have a strange problem with our attempt to impose
quotas. Any assistance would be helpful.
This is a user who we?ve attempted to impose a group quota on - however, many
of our OSTs (ignore the ones where quotactl failed) just don?t receive the
quota information.
On the OSTs with limit=0k below, they have the following configuration, while
the nodes like OST000f have correctly-updating information for limit_group:
$ sudo lctl get_param osd-*.*.quota_slave_dt.info
osd-ldiskfs.lustre-OST000e.quota_slave_dt.info=
target name: lustre-OST000e
pool ID: 0
type: dt
quota enabled: ugp
conn to master: setup
space acct: ugp
user uptodate: glb[0],slv[0],reint[0]
group uptodate: glb[0],slv[0],reint[0]
project uptodate: glb[0],slv[0],reint[0]
$ sudo lctl get_param osd-*.*.quota_slave_dt.limit_group
osd-ldiskfs.lustre-OST000e.quota_slave_dt.limit_group=
global_index_copy:
- id: 0
limits: { hard: 0, soft: 0, granted:
0, time: 0 }
?
# lfs quota -vhg 4055 /mnt/lustre/
Disk quotas for grp 4055 (gid 4055):
Filesystem used quota limit grace files quota limit grace
/mnt/lustre/ 2.694T* 0k 716.8G - 145961 0 0 -
lustre-MDT0000_UUID
76.81M - 1.075G - 145951 - 0 -
lustre-MDT0001_UUID
40k* - 40k - 10 - 0 -
quotactl ost0 failed.
lustre-OST0001_UUID
16.09G - 0k - - - - -
lustre-OST0002_UUID
251.7G - 0k - - - - -
quotactl ost3 failed.
quotactl ost4 failed.
quotactl ost5 failed.
quotactl ost6 failed.
lustre-OST0007_UUID
0k - 0k - - - - -
lustre-OST0008_UUID
525.1M* - 525.1M - - - - -
lustre-OST0009_UUID
540.7M* - 540.7M - - - - -
lustre-OST000a_UUID
385.8M* - 385.8M - - - - -
quotactl ost11 failed.
quotactl ost12 failed.
lustre-OST000d_UUID
191.9G - 0k - - - - -
lustre-OST000e_UUID
258.9G - 0k - - - - -
lustre-OST000f_UUID
86.99G - 87.99G - - - - -
lustre-OST0010_UUID
255.3G - 256.3G - - - - -
lustre-OST0011_UUID
254.1G - 0k - - - - -
lustre-OST0012_UUID
241.6G - 0k - - - - -
lustre-OST0013_UUID
241.6G - 0k - - - - -
lustre-OST0014_UUID
241.9G - 0k - - - - -
lustre-OST0015_UUID
237.4G - 0k - - - - -
lustre-OST0016_UUID
241.8G - 0k - - - - -
lustre-OST0017_UUID
237.8G - 0k - - - - -
lustre-OST0018_UUID
344.2M - 0k - - - - -
Total allocated inode limit: 0, total allocated block limit: 345.7G
Some errors happened when getting quota info. Some devices may be not working or
deactivated. The data in "[]" is inaccurate.
?
Thank you for your time.
Sincerely,
Robert Pennington, PhD
Tuebingen AI Center, Universitaet Tuebingen
Maria von Linden Str. 6
72076 Tuebingen
Germany
Office number: 10-30/A15
--
- no title specified
Jose Manuel Martínez García
Coordinador de Sistemas
Supercomputación de Castilla y León
Tel: 987 293 174
Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071
León, España
<https://www.scayle.es/>
Le informamos, como destinatario de este mensaje, que el correo
electrónico y las comunicaciones por medio de Internet no permiten
asegurar ni garantizar la confidencialidad de los mensajes transmitidos,
así como tampoco su integridad o su correcta recepción, por lo que
SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no
consintiese en la utilización del correo electrónico o de las
comunicaciones vía Internet le rogamos nos lo comunique y ponga en
nuestro conocimiento de manera inmediata. Para más información visite
nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org