On Wed, Jul 3, 2019 at 10:59 PM FNU Raghavendra Manjunath <[email protected]> wrote:
> > > On Wed, Jul 3, 2019 at 3:28 AM Pranith Kumar Karampuri < > [email protected]> wrote: > >> >> >> On Wed, Jul 3, 2019 at 10:14 AM Ravishankar N <[email protected]> >> wrote: >> >>> >>> On 02/07/19 8:52 PM, FNU Raghavendra Manjunath wrote: >>> >>> >>> Hi All, >>> >>> In glusterfs, there is an issue regarding the fallocate behavior. In >>> short, if someone does fallocate from the mount point with some size that >>> is greater than the available size in the backend filesystem where the file >>> is present, then fallocate can fail with a subset of the required number of >>> blocks allocated and then failing in the backend filesystem with ENOSPC >>> error. >>> >>> The behavior of fallocate in itself is simlar to how it would have been >>> on a disk filesystem (atleast xfs where it was checked). i.e. allocates >>> subset of the required number of blocks and then fail with ENOSPC. And the >>> file in itself would show the number of blocks in stat to be whatever was >>> allocated as part of fallocate. Please refer [1] where the issue is >>> explained. >>> >>> Now, there is one small difference between how the behavior is between >>> glusterfs and xfs. >>> In xfs after fallocate fails, doing 'stat' on the file shows the number >>> of blocks that have been allocated. Whereas in glusterfs, the number of >>> blocks is shown as zero which makes tools like "du" show zero consumption. >>> This difference in behavior in glusterfs is because of libglusterfs on how >>> it handles sparse files etc for calculating number of blocks (mentioned in >>> [1]) >>> >>> At this point I can think of 3 things on how to handle this. >>> >>> 1) Except for how many blocks are shown in the stat output for the file >>> from the mount point (on which fallocate was done), the remaining behavior >>> of attempting to allocate the requested size and failing when the >>> filesystem becomes full is similar to that of XFS. >>> >>> Hence, what is required is to come up with a solution on how >>> libglusterfs calculate blocks for sparse files etc (without breaking any of >>> the existing components and features). This makes the behavior similar to >>> that of backend filesystem. This might require its own time to fix >>> libglusterfs logic without impacting anything else. >>> >>> I think we should just revert the commit >>> b1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if it really >>> breaks anything (or check whatever it breaks is something that we can live >>> with). XFS speculative preallocation is not permanent and the extra space >>> is freed up eventually. It can be sped up via procfs tunable: >>> http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F. >>> We could also tune the allocsize option to a low value like 4k so that >>> glusterfs quota is not affected. >>> >>> FWIW, ENOSPC is not the only fallocate problem in gluster because of >>> 'iatt->ia_block' tweaking. It also breaks the --keep-size option (i.e. the >>> FALLOC_FL_KEEP_SIZE flag in fallocate(2)) and reports incorrect du size. >>> >> Regards, >>> Ravi >>> >>> >>> OR >>> >>> 2) Once the fallocate fails in the backend filesystem, make posix xlator >>> in the brick truncate the file to the previous size of the file before >>> attempting fallocate. A patch [2] has been sent for this. But there is an >>> issue with this when there are parallel writes and fallocate operations >>> happening on the same file. It can lead to a data loss. >>> >>> a) statpre is obtained ===> before fallocate is attempted, get the stat >>> hence the size of the file b) A parrallel Write fop on the same file that >>> extends the file is successful c) Fallocate fails d) ftruncate truncates it >>> to size given by statpre (i.e. the previous stat and the size obtained in >>> step a) >>> >>> OR >>> >>> 3) Make posix check for available disk size before doing fallocate. i.e. >>> in fallocate once posix gets the number of bytes to be allocated for the >>> file from a particular offset, it checks whether so many bytes are >>> available or not in the disk. If not, fail the fallocate fop with ENOSPC >>> (without attempting it on the backend filesystem). >>> >>> There still is a probability of a parallel write happening while this >>> fallocate is happening and by the time falllocate system call is attempted >>> on the disk, the available space might have been less than what was >>> calculated before fallocate. >>> i.e. following things can happen >>> >>> a) statfs ===> get the available space of the backend filesystem >>> b) a parallel write succeeds and extends the file >>> c) fallocate is attempted assuming there is sufficient space in the >>> backend >>> >>> While the above situation can arise, I think we are still fine. Because >>> fallocate is attempted from the offset received in the fop. So, >>> irrespective of whether write extended the file or not, the fallocate >>> itself will be attempted for so many bytes from the offset which we found >>> to be available by getting statfs information. >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3 >>> [2] https://review.gluster.org/#/c/glusterfs/+/22969/ >>> >>> >> option 2) will affect performance if we have to serialize all the data >> operations on the file. >> option 3) can still lead to the same problem we are trying to solve in a >> different way. >> - thread-1: fallocate came with 1MB size, Statfs says there is >> 1MB space. >> - thread-2: Write on a different file is attempted with 128KB >> and succeeds >> - thread-1: fallocate fails on the file after partially >> allocating size because there doesn't exist 1MB anymore. >> >> > Here I have a doubt. Even if a 128K write on the file succeeds, IIUC > fallocate will try to reserve 1MB of space relative to the offset that was > received as part of the fallocate call which was found to be available. > So, despite write succeeding, the region fallocate aimed at was 1MB of > space from a particular offset. As long as that is available, can posix > still go ahead and perform the fallocate operation? > It can go ahead and perform the operation. Just that in the case I mentioned it will lead to partial success because the size fallocate wants to reserve is not available. > > Regards, > Raghavendra > > > > >> So option-1 is what we need to explore and fix it so that the behavior is >> closer to other posix filesystems. Maybe start with what Ravi suggested? >> >> >>> Please provide feedback. >>> >>> Regards, >>> Raghavendra >>> >>> _______________________________________________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/836554017 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/486278655 >>> >>> Gluster-devel mailing >>> [email protected]https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> _______________________________________________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/836554017 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/486278655 >>> >>> Gluster-devel mailing list >>> [email protected] >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >> >> -- >> Pranith >> > -- Pranith
_______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-devel
