[OpenAFS] openafs volume corruption when partition usage above 95%

Thossaporn (Pommm) Phetruphant Mon, 06 Jan 2020 04:00:44 -0800

Hi,

We are using openafs-server 1.6.15 on Ubuntu 16.04.

Our cell consists of 9 large storage servers each running MD-RAID6, fora total of 550TB of storage.We are experiencing weird behaviour occasionally when the partitionsapproach about 95% full.For example, on one server we have a 60TB partition where corruptionstarts occurring when we use up to 57TB with 3TB space remaining available.Some volumes on this partition start to have corruption in the way thatthe RO id number changes to an invalid value.

For example vos examine p.xxx.001 gave us the following prior to thecorruption:

p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536870982 Backup

Note how the ROnly volume id equals the RWrite volume id plus 1.

But as the partition filled up with other volumes, this volume entry inthe VLDB show the following for 'vos examine p.xxx.001':

p.xxx.001 536870981 RW 3459 K On-line
storage1.aaa.com /vicepa
RWrite 5360870981 ROnly 536154372 Backup

Note that the ROnly volume id has changed to 536154372, which isactually the RW site of another volume in that partition.Salvage volume does not fix this. The only way we found to correct thisissue is to copy this RW volume's data out to another partition and zapthis volume.

But the real question is why is this happening? And why would it onlyhappen when the partition usage gets over 95% despite the fact that thispartition never has less than 3TB available?

Has anyone else encountered something like this? Does anyone have anysuggestions where we might look at a configuration issue or something wemight be doing wrong that might be causing this?


Sincerely,

Pommm

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

[OpenAFS] openafs volume corruption when partition usage above 95%

Reply via email to