Re: [Gluster-devel] fallocate behavior in glusterfs

Ravishankar N Tue, 02 Jul 2019 21:44:05 -0700


On 02/07/19 8:52 PM, FNU Raghavendra Manjunath wrote:

Hi All,
In glusterfs, there is an issue regarding the fallocate behavior. Inshort, if someone does fallocate from the mount point with some sizethat is greater than the available size in the backend filesystemwhere the file is present, then fallocate can fail with a subset ofthe required number of blocks allocated and then failing in thebackend filesystem with ENOSPC error.
The behavior of fallocate in itself is simlar to how it would havebeen on a disk filesystem (atleast xfs where it was checked). i.e.allocates subset of the required number of blocks and then fail withENOSPC. And the file in itself would show the number of blocks in statto be whatever was allocated as part of fallocate. Please refer [1]where the issue is explained.
Now, there is one small difference between how the behavior is betweenglusterfs and xfs.In xfs after fallocate fails, doing 'stat' on the file shows thenumber of blocks that have been allocated. Whereas in glusterfs, thenumber of blocks is shown as zero which makes tools like "du" showzero consumption. This difference in behavior in glusterfs is becauseof libglusterfs on how it handles sparse files etc for calculatingnumber of blocks (mentioned in [1])
At this point I can think of 3 things on how to handle this.
1) Except for how many blocks are shown in the stat output for thefile from the mount point (on which fallocate was done), the remainingbehavior of attempting to allocate the requested size and failing whenthe filesystem becomes full is similar to that of XFS.
Hence, what is required is to come up with a solution on howlibglusterfs calculate blocks for sparse files etc (without breakingany of the existing components and features). This makes the behaviorsimilar to that of backend filesystem. This might require its own timeto fix libglusterfs logic without impacting anything else.

I think we should just revert the commitb1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if itreally breaks anything (or check whatever it breaks is something that wecan live with). XFS speculative preallocation is not permanent and theextra space is freed up eventually. It can be sped up via procfstunable:http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F.We could also tune the allocsize option to a low value like 4k so thatglusterfs quota is not affected.

FWIW, ENOSPC is not the only fallocate problem in gluster because of 'iatt->ia_block' tweaking. It also breaks the --keep-size option (i.e.the FALLOC_FL_KEEP_SIZE flag in fallocate(2)) and reports incorrect du size.


Regards,
Ravi

OR
2) Once the fallocate fails in the backend filesystem, make posixxlator in the brick truncate the file to the previous size of the filebefore attempting fallocate. A patch [2] has been sent for this. Butthere is an issue with this when there are parallel writes andfallocate operations happening on the same file. It can lead to a dataloss.
a) statpre is obtained ===> before fallocate is attempted, get thestat hence the size of the file b) A parrallel Write fop on the samefile that extends the file is successful c) Fallocate fails d)ftruncate truncates it to size given by statpre (i.e. the previousstat and the size obtained in step a)
OR
3) Make posix check for available disk size before doing fallocate.i.e. in fallocate once posix gets the number of bytes to be allocatedfor the file from a particular offset, it checks whether so many bytesare available or not in the disk. If not, fail the fallocate fop withENOSPC (without attempting it on the backend filesystem).
There still is a probability of a parallel write happening while thisfallocate is happening and by the time falllocate system call isattempted on the disk, the available space might have been less thanwhat was calculated before fallocate.
i.e. following things can happen

 a) statfs ===> get the available space of the backend filesystem
 b) a parallel write succeeds and extends the file
c) fallocate is attempted assuming there is sufficient space in thebackend
While the above situation can arise, I think we are still fine.Because fallocate is attempted from the offset received in the fop.So, irrespective of whether write extended the file or not, thefallocate itself will be attempted for so many bytes from the offsetwhich we found to be available by getting statfs information.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3
[2] https://review.gluster.org/#/c/glusterfs/+/22969/

Please provide feedback.

Regards,
Raghavendra

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] fallocate behavior in glusterfs

Reply via email to