Hi,

We are using openafs-server 1.6.15 on Ubuntu 16.04.
Our cell consists of 9 large storage servers each running MD-RAID6, for a total of 550TB of storage. We are experiencing weird behaviour occasionally when the partitions approach about 95% full. For example, on one server we have a 60TB partition where corruption starts occurring when we use up to 57TB with 3TB space remaining available. Some volumes on this partition start to have corruption in the way that the RO id number changes to an invalid value.

For example vos examine p.xxx.001 gave us the following prior to the corruption:
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536870982 Backup

Note how the ROnly volume id equals the RWrite volume id plus 1.

But as the partition filled up with other volumes, this volume entry in the VLDB show the following for 'vos examine p.xxx.001':
p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536154372 Backup

Note that the ROnly volume id has changed to 536154372, which is actually the RW site of another volume in that partition. Salvage volume does not fix this. The only way we found to correct this issue is to copy this RW volume's data out to another partition and zap this volume.

But the real question is why is this happening? And why would it only happen when the partition usage gets over 95% despite the fact that this partition never has less than 3TB available?

Has anyone else encountered something like this? Does anyone have any suggestions where we might look at a configuration issue or something we might be doing wrong that might be causing this?

Sincerely,

Pommm

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to