I have sent a rfc patch [1] for review. https://review.gluster.org/#/c/glusterfs/+/23011/
On Thu, Jul 4, 2019 at 1:13 AM Pranith Kumar Karampuri <pkara...@redhat.com> wrote: > > > On Wed, Jul 3, 2019 at 10:59 PM FNU Raghavendra Manjunath < > rab...@redhat.com> wrote: > >> >> >> On Wed, Jul 3, 2019 at 3:28 AM Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >>> >>> >>> On Wed, Jul 3, 2019 at 10:14 AM Ravishankar N <ravishan...@redhat.com> >>> wrote: >>> >>>> >>>> On 02/07/19 8:52 PM, FNU Raghavendra Manjunath wrote: >>>> >>>> >>>> Hi All, >>>> >>>> In glusterfs, there is an issue regarding the fallocate behavior. In >>>> short, if someone does fallocate from the mount point with some size that >>>> is greater than the available size in the backend filesystem where the file >>>> is present, then fallocate can fail with a subset of the required number of >>>> blocks allocated and then failing in the backend filesystem with ENOSPC >>>> error. >>>> >>>> The behavior of fallocate in itself is simlar to how it would have been >>>> on a disk filesystem (atleast xfs where it was checked). i.e. allocates >>>> subset of the required number of blocks and then fail with ENOSPC. And the >>>> file in itself would show the number of blocks in stat to be whatever was >>>> allocated as part of fallocate. Please refer [1] where the issue is >>>> explained. >>>> >>>> Now, there is one small difference between how the behavior is between >>>> glusterfs and xfs. >>>> In xfs after fallocate fails, doing 'stat' on the file shows the number >>>> of blocks that have been allocated. Whereas in glusterfs, the number of >>>> blocks is shown as zero which makes tools like "du" show zero consumption. >>>> This difference in behavior in glusterfs is because of libglusterfs on how >>>> it handles sparse files etc for calculating number of blocks (mentioned in >>>> [1]) >>>> >>>> At this point I can think of 3 things on how to handle this. >>>> >>>> 1) Except for how many blocks are shown in the stat output for the file >>>> from the mount point (on which fallocate was done), the remaining behavior >>>> of attempting to allocate the requested size and failing when the >>>> filesystem becomes full is similar to that of XFS. >>>> >>>> Hence, what is required is to come up with a solution on how >>>> libglusterfs calculate blocks for sparse files etc (without breaking any of >>>> the existing components and features). This makes the behavior similar to >>>> that of backend filesystem. This might require its own time to fix >>>> libglusterfs logic without impacting anything else. >>>> >>>> I think we should just revert the commit >>>> b1a5fa55695f497952264e35a9c8eb2bbf1ec4c3 (BZ 817343) and see if it really >>>> breaks anything (or check whatever it breaks is something that we can live >>>> with). XFS speculative preallocation is not permanent and the extra space >>>> is freed up eventually. It can be sped up via procfs tunable: >>>> http://xfs.org/index.php/XFS_FAQ#Q:_How_can_I_speed_up_or_avoid_delayed_removal_of_speculative_preallocation.3F. >>>> We could also tune the allocsize option to a low value like 4k so that >>>> glusterfs quota is not affected. >>>> >>>> FWIW, ENOSPC is not the only fallocate problem in gluster because of >>>> 'iatt->ia_block' tweaking. It also breaks the --keep-size option (i.e. the >>>> FALLOC_FL_KEEP_SIZE flag in fallocate(2)) and reports incorrect du size. >>>> >>> Regards, >>>> Ravi >>>> >>>> >>>> OR >>>> >>>> 2) Once the fallocate fails in the backend filesystem, make posix >>>> xlator in the brick truncate the file to the previous size of the file >>>> before attempting fallocate. A patch [2] has been sent for this. But there >>>> is an issue with this when there are parallel writes and fallocate >>>> operations happening on the same file. It can lead to a data loss. >>>> >>>> a) statpre is obtained ===> before fallocate is attempted, get the stat >>>> hence the size of the file b) A parrallel Write fop on the same file that >>>> extends the file is successful c) Fallocate fails d) ftruncate truncates it >>>> to size given by statpre (i.e. the previous stat and the size obtained in >>>> step a) >>>> >>>> OR >>>> >>>> 3) Make posix check for available disk size before doing fallocate. >>>> i.e. in fallocate once posix gets the number of bytes to be allocated for >>>> the file from a particular offset, it checks whether so many bytes are >>>> available or not in the disk. If not, fail the fallocate fop with ENOSPC >>>> (without attempting it on the backend filesystem). >>>> >>>> There still is a probability of a parallel write happening while this >>>> fallocate is happening and by the time falllocate system call is attempted >>>> on the disk, the available space might have been less than what was >>>> calculated before fallocate. >>>> i.e. following things can happen >>>> >>>> a) statfs ===> get the available space of the backend filesystem >>>> b) a parallel write succeeds and extends the file >>>> c) fallocate is attempted assuming there is sufficient space in the >>>> backend >>>> >>>> While the above situation can arise, I think we are still fine. Because >>>> fallocate is attempted from the offset received in the fop. So, >>>> irrespective of whether write extended the file or not, the fallocate >>>> itself will be attempted for so many bytes from the offset which we found >>>> to be available by getting statfs information. >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1724754#c3 >>>> [2] https://review.gluster.org/#/c/glusterfs/+/22969/ >>>> >>>> >>> option 2) will affect performance if we have to serialize all the data >>> operations on the file. >>> option 3) can still lead to the same problem we are trying to solve in a >>> different way. >>> - thread-1: fallocate came with 1MB size, Statfs says there is >>> 1MB space. >>> - thread-2: Write on a different file is attempted with 128KB >>> and succeeds >>> - thread-1: fallocate fails on the file after partially >>> allocating size because there doesn't exist 1MB anymore. >>> >>> >> Here I have a doubt. Even if a 128K write on the file succeeds, IIUC >> fallocate will try to reserve 1MB of space relative to the offset that was >> received as part of the fallocate call which was found to be available. >> So, despite write succeeding, the region fallocate aimed at was 1MB of >> space from a particular offset. As long as that is available, can posix >> still go ahead and perform the fallocate operation? >> > > It can go ahead and perform the operation. Just that in the case I > mentioned it will lead to partial success because the size fallocate wants > to reserve is not available. > > >> >> Regards, >> Raghavendra >> >> >> >> >>> So option-1 is what we need to explore and fix it so that the behavior >>> is closer to other posix filesystems. Maybe start with what Ravi suggested? >>> >>> >>>> Please provide feedback. >>>> >>>> Regards, >>>> Raghavendra >>>> >>>> _______________________________________________ >>>> >>>> Community Meeting Calendar: >>>> >>>> APAC Schedule - >>>> Every 2nd and 4th Tuesday at 11:30 AM IST >>>> Bridge: https://bluejeans.com/836554017 >>>> >>>> NA/EMEA Schedule - >>>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>>> Bridge: https://bluejeans.com/486278655 >>>> >>>> Gluster-devel mailing >>>> listGluster-devel@gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> _______________________________________________ >>>> >>>> Community Meeting Calendar: >>>> >>>> APAC Schedule - >>>> Every 2nd and 4th Tuesday at 11:30 AM IST >>>> Bridge: https://bluejeans.com/836554017 >>>> >>>> NA/EMEA Schedule - >>>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>>> Bridge: https://bluejeans.com/486278655 >>>> >>>> Gluster-devel mailing list >>>> Gluster-devel@gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>> >>> -- >>> Pranith >>> >> > > -- > Pranith >
_______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel