Re: seekdir/readdir patch .. Call for Review.
On 5/6/15 7:33 AM, Rick Macklem wrote: Julian Elischer wrote: On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). At least it needs some more information in struct dirent than there is now.. A cookie is the current fashion.. that assumes however that the filesystems are capable of converting backwards from cookie to 'location'. ZFS claims to be able to do so.. My current plan for a patch is... - d_cookie would be the "physical" file system position of the next directory entry - a ngetdirentries() would take a "physical" cookie as a value/result argument. It would indicate where to start and would return the cookie for the next entry after what is returned. (It could probably be stuffed in uio_offset, but I think it might be clearer to make it a separate arg.) --> It would pass this physical cookie down to the file system's VOP_READDIR(). (For UFS it would be the byte offset in the on-disk directory. For ZFS, I believe it is an index for the entry. For NFS, it is the cookie that is sent to the server. For others, I don't yet know.) - dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could just return the cookie and forget about the loc_XX structures?) This would get rid of the loc_seek, loc_loc bogosity that no longer makes much sense, since the byte offset in what is returned by getdirentries() has little to do with the "physical" directory position. I have already done the kernel stuff for some file systems and the libc changes actually simplify things compared to what is there now. zfs already has a cookie production facility as part of the VFS readdir (or is it dir-read?) method. rick Thee other thing to do would be to store some sort of strong hash of the name and inode# in each telldir entry.. we would seek to the saved seek location and seek forward computing or looking for a matching hash. I woudl also add that the man pages talk about the readdir blocksize a bit and mention the file blocksize (and stat) which is often way dfferent from 512 bytes.. usually 16k or so these days. I found setting the read size to the same as the fs blocksize seemd to work well. This means that a UFS-style di
Re: seekdir/readdir patch .. Call for Review.
Julian Elischer wrote: > On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov > > wrote: > >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > >>> if you are interested in readdir(3), seekdir(3) and telldir(3) > >>> then > >>> you should look at > >>> https://reviews.freebsd.org/D2410 > >>> this patches around a problem in seekdir() that breaks Samba. > >>> Seekdir(3) will not work as expected when files prior to the > >>> point of > >>> interest in directory have been deleted since the directory was > >>> opened. > >>> Windows clients using Samba cause both these things to happen, > >>> causing > >>> the next readdir(3) after the bad seekdir(3) to skip some entries > >>> and > >>> return the wrong file. > >>> Samba only needs to step back a single directory entry in the > >>> case > >>> where it reads an entry and then discovers it can't fit it into > >>> the > >>> buffer it is sending to the windows client. It turns out we can > >>> reliably cater to Samba's requirement because the "last returned > >>> element" is always still in memory, so with a little care, we can > >>> set our filepointer back to it safely. (once) > >>> seekdir and readdir (and telldir()) need a complete rewrite along > >>> with > >>> getdirentries() but that is more than a small edit like this. > >> Can you explain your expectations from the whole readdir() vs. > >> parallel > >> directory modifications interaction ? From what I understood so > >> far, > >> there is unlocked modification of the container and parallel > >> iterator > >> over the same container. IMO, in such situation, whatever tweaks > >> you > >> apply to the iterator, it is still cannot be made reliable. > >> Before making single-purpose changes to the libc readdir and > >> seekdir > >> code, or to the kernel code, it would be useful to state exact > >> behaviour > >> of the dirent machinery we want to see. No, 'make samba works in > >> my > >> situation' does not sound good enough. > > Consider the subsequence of entries that existed at opendir() time > > and > > were not removed until now. This subsequence is clearly defined and > > does > > not have concurrency problems. The order of this subsequence must > > remain > > unchanged and seekdir() must be correct with respect to this > > subsequence. > > > > Additionally, two other kinds of entries may be returned. New > > entries > > may be inserted anywhere in between the entries of the subsequence, > > and > > removed entries may be returned as if they were still part of the > > subsequence (so that not every readdir() needs a system call). > > > > A simple implementation for UFS-style directories is to store the > > offset > > in the directory (all bits of it, not masking off the lower 9 > > bits). > > This needs d_off or similar in struct dirent. The kernel > > getdirentries() > > then needs a similar loop as the old libc seekdir() to go from the > > start > > of the 512-byte directory block to the desired entry (since an > > entry may > > not exist at the stored offset within the directory block). > At least it needs some more information in struct dirent than there > is > now.. > A cookie is the current fashion.. that assumes however that the > filesystems > are capable of converting backwards from cookie to 'location'. ZFS > claims to > be able to do so.. My current plan for a patch is... - d_cookie would be the "physical" file system position of the next directory entry - a ngetdirentries() would take a "physical" cookie as a value/result argument. It would indicate where to start and would return the cookie for the next entry after what is returned. (It could probably be stuffed in uio_offset, but I think it might be clearer to make it a separate arg.) --> It would pass this physical cookie down to the file system's VOP_READDIR(). (For UFS it would be the byte offset in the on-disk directory. For ZFS, I believe it is an index for the entry. For NFS, it is the cookie that is sent to the server. For others, I don't yet know.) - dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could just return the cookie and forget about the loc_XX structures?) This would get rid of the loc_seek, loc_loc bogosity that no longer makes much sense, since the byte offset in what is returned by getdirentries() has little to do with the "physical" directory position. I have already done the kernel stuff for some file systems and the libc changes actually simplify things compared to what is there now. rick > Thee other thing to do would be to store some sort > of strong > hash of the name and inode# in each telldir entry.. > we would seek to the saved seek location and seek forward computing > or > looking > for a matching hash. I woudl also add that the man pages talk about > the > readdir blocksize a bit and mention the
Re: seekdir/readdir patch .. Call for Review.
Julian Elischer wrote: > On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov > > wrote: > >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > >>> if you are interested in readdir(3), seekdir(3) and telldir(3) > >>> then > >>> you should look at > >>> https://reviews.freebsd.org/D2410 > >>> this patches around a problem in seekdir() that breaks Samba. > >>> Seekdir(3) will not work as expected when files prior to the > >>> point of > >>> interest in directory have been deleted since the directory was > >>> opened. > >>> Windows clients using Samba cause both these things to happen, > >>> causing > >>> the next readdir(3) after the bad seekdir(3) to skip some entries > >>> and > >>> return the wrong file. > >>> Samba only needs to step back a single directory entry in the > >>> case > >>> where it reads an entry and then discovers it can't fit it into > >>> the > >>> buffer it is sending to the windows client. It turns out we can > >>> reliably cater to Samba's requirement because the "last returned > >>> element" is always still in memory, so with a little care, we can > >>> set our filepointer back to it safely. (once) > >>> seekdir and readdir (and telldir()) need a complete rewrite along > >>> with > >>> getdirentries() but that is more than a small edit like this. > >> Can you explain your expectations from the whole readdir() vs. > >> parallel > >> directory modifications interaction ? From what I understood so > >> far, > >> there is unlocked modification of the container and parallel > >> iterator > >> over the same container. IMO, in such situation, whatever tweaks > >> you > >> apply to the iterator, it is still cannot be made reliable. > >> Before making single-purpose changes to the libc readdir and > >> seekdir > >> code, or to the kernel code, it would be useful to state exact > >> behaviour > >> of the dirent machinery we want to see. No, 'make samba works in > >> my > >> situation' does not sound good enough. > > Consider the subsequence of entries that existed at opendir() time > > and > > were not removed until now. This subsequence is clearly defined and > > does > > not have concurrency problems. The order of this subsequence must > > remain > > unchanged and seekdir() must be correct with respect to this > > subsequence. > > > > Additionally, two other kinds of entries may be returned. New > > entries > > may be inserted anywhere in between the entries of the subsequence, > > and > > removed entries may be returned as if they were still part of the > > subsequence (so that not every readdir() needs a system call). > > > > A simple implementation for UFS-style directories is to store the > > offset > > in the directory (all bits of it, not masking off the lower 9 > > bits). > > This needs d_off or similar in struct dirent. The kernel > > getdirentries() > > then needs a similar loop as the old libc seekdir() to go from the > > start > > of the 512-byte directory block to the desired entry (since an > > entry may > > not exist at the stored offset within the directory block). > > > > This means that a UFS-style directory cannot be compacted (existing > > entries moved from higher to lower offsets to fill holes) while it > > is > > open for reading. An NFS exported directory is always open for > > reading. > > > > This also means that duplicate entries can only be returned if that > > particular filename was deleted and created again. > > > > Without kernel support, it is hard to get telldir/seekdir > > completely > > reliable. The current libc implementation is wrong since the > > "holes" > > within the block just disappear and change the offsets of the > > following > > entries; the kernel cannot fix this using entries with d_fileno = 0 > > since it cannot know, in the general case, how long the deleted > > entry > > was in the filesystem-independent dirent format. My previous idea > > of > > storing one d_fileno during telldir() is wrong since it will fail > > if > > that entry is deleted. > > > > If you do not care about memory usage (which probably is already > > excessive with the current libc implementation), you could store at > > telldir() time the offset of the current block returned by > > getdirentries() and the d_fileno of all entries already returned in > > the > > current block. > > > > The D2410 patch can conceptually work for what Samba needs, > > stepping > > back one directory entry. I will comment on it. > thanks > > your comment is correct, but I don't think it really matters because > I'm only claiming > to fix a really small set of possible usages.. I might add a > sentence > in the seekdir > man page specifying what does and doesn't work. > Ok, I think I finally understand the bug that this patch is fixing... If telldir() is called when at the end of a block read by getdirentries(), it sets loc_seek to the seek position for the block and loc_loc to the offset of the end of the block.
Re: seekdir/readdir patch .. Call for Review.
Julian Elischer wrote: > On 5/5/15 8:42 AM, Rick Macklem wrote: > > Julian Elischer wrote: > >> On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > >>> On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov > >>> wrote: > On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > > if you are interested in readdir(3), seekdir(3) and telldir(3) > > then > > you should look at > > https://reviews.freebsd.org/D2410 > > this patches around a problem in seekdir() that breaks Samba. > > Seekdir(3) will not work as expected when files prior to the > > point of > > interest in directory have been deleted since the directory was > > opened. > > Windows clients using Samba cause both these things to happen, > > causing > > the next readdir(3) after the bad seekdir(3) to skip some > > entries > > and > > return the wrong file. > > Samba only needs to step back a single directory entry in the > > case > > where it reads an entry and then discovers it can't fit it into > > the > > buffer it is sending to the windows client. It turns out we can > > reliably cater to Samba's requirement because the "last > > returned > > element" is always still in memory, so with a little care, we > > can > > set our filepointer back to it safely. (once) > > seekdir and readdir (and telldir()) need a complete rewrite > > along > > with > > getdirentries() but that is more than a small edit like this. > Can you explain your expectations from the whole readdir() vs. > parallel > directory modifications interaction ? From what I understood so > far, > there is unlocked modification of the container and parallel > iterator > over the same container. IMO, in such situation, whatever > tweaks > you > apply to the iterator, it is still cannot be made reliable. > Before making single-purpose changes to the libc readdir and > seekdir > code, or to the kernel code, it would be useful to state exact > behaviour > of the dirent machinery we want to see. No, 'make samba works in > my > situation' does not sound good enough. > >>> Consider the subsequence of entries that existed at opendir() > >>> time > >>> and > >>> were not removed until now. This subsequence is clearly defined > >>> and > >>> does > >>> not have concurrency problems. The order of this subsequence must > >>> remain > >>> unchanged and seekdir() must be correct with respect to this > >>> subsequence. > >>> > >>> Additionally, two other kinds of entries may be returned. New > >>> entries > >>> may be inserted anywhere in between the entries of the > >>> subsequence, > >>> and > >>> removed entries may be returned as if they were still part of the > >>> subsequence (so that not every readdir() needs a system call). > >>> > >>> A simple implementation for UFS-style directories is to store the > >>> offset > >>> in the directory (all bits of it, not masking off the lower 9 > >>> bits). > >>> This needs d_off or similar in struct dirent. The kernel > >>> getdirentries() > >>> then needs a similar loop as the old libc seekdir() to go from > >>> the > >>> start > >>> of the 512-byte directory block to the desired entry (since an > >>> entry may > >>> not exist at the stored offset within the directory block). > >>> > >>> This means that a UFS-style directory cannot be compacted > >>> (existing > >>> entries moved from higher to lower offsets to fill holes) while > >>> it > >>> is > >>> open for reading. An NFS exported directory is always open for > >>> reading. > >>> > >>> This also means that duplicate entries can only be returned if > >>> that > >>> particular filename was deleted and created again. > >>> > >>> Without kernel support, it is hard to get telldir/seekdir > >>> completely > >>> reliable. The current libc implementation is wrong since the > >>> "holes" > >>> within the block just disappear and change the offsets of the > >>> following > >>> entries; the kernel cannot fix this using entries with d_fileno = > >>> 0 > >>> since it cannot know, in the general case, how long the deleted > >>> entry > >>> was in the filesystem-independent dirent format. My previous idea > >>> of > >>> storing one d_fileno during telldir() is wrong since it will fail > >>> if > >>> that entry is deleted. > >>> > >>> If you do not care about memory usage (which probably is already > >>> excessive with the current libc implementation), you could store > >>> at > >>> telldir() time the offset of the current block returned by > >>> getdirentries() and the d_fileno of all entries already returned > >>> in > >>> the > >>> current block. > >>> > >>> The D2410 patch can conceptually work for what Samba needs, > >>> stepping > >>> back one directory entry. I will comment on it. > >>> > >> how long do I have to wait until I can commit this and was kib's > >> comment a > >> "do not commit"?
Re: seekdir/readdir patch .. Call for Review.
On 5/5/15 8:42 AM, Rick Macklem wrote: Julian Elischer wrote: On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. how long do I have to wait until I can commit this and was kib's comment a "do not commit"? What about the bug Jilles reports in D2410? - He said you might fix the problem by having telldir move the entry to the head of the list when it has a hit. However, this means that an "old" entry gets modified. Is it possible for this "modified" entry to be a match and get modified again, and so on...? I will comment on the patch once you decide how to deal with Jilles bug. I don't think is is a "bug" as such.. it wasn't a case I was looking to fix and it is just as it was before. I'd rephrase it as: "Jilles asks that we also fix the case where the previous telldir response is a recycling of an old telldir response." In actual fact this scenario will almost never happen. because the previous time the telldir call returned that location, the 'fixtelldir' function would have later been called on the result, whi
Re: seekdir/readdir patch .. Call for Review.
On 5/5/15 12:04 AM, Konstantin Belousov wrote: On Mon, May 04, 2015 at 10:52:42PM +0800, Julian Elischer wrote: On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. how long do I have to wait until I can commit this and was kib's comment a "do not commit"? No, I only mean what I asked about. I do not have strong objections about the patch, but whatever is done in this regard, should clearly explain the case it handles and related limitations (IMO). ok I'll add some comments and add more in the commit message and man page. JUlian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On Mon, May 04, 2015 at 10:52:42PM +0800, Julian Elischer wrote: > On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: > >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > >>> if you are interested in readdir(3), seekdir(3) and telldir(3) then > >>> you should look at > >>> https://reviews.freebsd.org/D2410 > >>> this patches around a problem in seekdir() that breaks Samba. > >>> Seekdir(3) will not work as expected when files prior to the point of > >>> interest in directory have been deleted since the directory was opened. > >>> Windows clients using Samba cause both these things to happen, causing > >>> the next readdir(3) after the bad seekdir(3) to skip some entries and > >>> return the wrong file. > >>> Samba only needs to step back a single directory entry in the case > >>> where it reads an entry and then discovers it can't fit it into the > >>> buffer it is sending to the windows client. It turns out we can > >>> reliably cater to Samba's requirement because the "last returned > >>> element" is always still in memory, so with a little care, we can > >>> set our filepointer back to it safely. (once) > >>> seekdir and readdir (and telldir()) need a complete rewrite along with > >>> getdirentries() but that is more than a small edit like this. > >> Can you explain your expectations from the whole readdir() vs. parallel > >> directory modifications interaction ? From what I understood so far, > >> there is unlocked modification of the container and parallel iterator > >> over the same container. IMO, in such situation, whatever tweaks you > >> apply to the iterator, it is still cannot be made reliable. > >> Before making single-purpose changes to the libc readdir and seekdir > >> code, or to the kernel code, it would be useful to state exact behaviour > >> of the dirent machinery we want to see. No, 'make samba works in my > >> situation' does not sound good enough. > > Consider the subsequence of entries that existed at opendir() time and > > were not removed until now. This subsequence is clearly defined and does > > not have concurrency problems. The order of this subsequence must remain > > unchanged and seekdir() must be correct with respect to this > > subsequence. > > > > Additionally, two other kinds of entries may be returned. New entries > > may be inserted anywhere in between the entries of the subsequence, and > > removed entries may be returned as if they were still part of the > > subsequence (so that not every readdir() needs a system call). > > > > A simple implementation for UFS-style directories is to store the offset > > in the directory (all bits of it, not masking off the lower 9 bits). > > This needs d_off or similar in struct dirent. The kernel getdirentries() > > then needs a similar loop as the old libc seekdir() to go from the start > > of the 512-byte directory block to the desired entry (since an entry may > > not exist at the stored offset within the directory block). > > > > This means that a UFS-style directory cannot be compacted (existing > > entries moved from higher to lower offsets to fill holes) while it is > > open for reading. An NFS exported directory is always open for reading. > > > > This also means that duplicate entries can only be returned if that > > particular filename was deleted and created again. > > > > Without kernel support, it is hard to get telldir/seekdir completely > > reliable. The current libc implementation is wrong since the "holes" > > within the block just disappear and change the offsets of the following > > entries; the kernel cannot fix this using entries with d_fileno = 0 > > since it cannot know, in the general case, how long the deleted entry > > was in the filesystem-independent dirent format. My previous idea of > > storing one d_fileno during telldir() is wrong since it will fail if > > that entry is deleted. > > > > If you do not care about memory usage (which probably is already > > excessive with the current libc implementation), you could store at > > telldir() time the offset of the current block returned by > > getdirentries() and the d_fileno of all entries already returned in the > > current block. > > > > The D2410 patch can conceptually work for what Samba needs, stepping > > back one directory entry. I will comment on it. > > > how long do I have to wait until I can commit this and was kib's > comment a > "do not commit"? No, I only mean what I asked about. I do not have strong objections about the patch, but whatever is done in this regard, should clearly explain the case it handles and related limitations (IMO). ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. how long do I have to wait until I can commit this and was kib's comment a "do not commit"? ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). At least it needs some more information in struct dirent than there is now.. A cookie is the current fashion.. that assumes however that the filesystems are capable of converting backwards from cookie to 'location'. ZFS claims to be able to do so.. Thee other thing to do would be to store some sort of strong hash of the name and inode# in each telldir entry.. we would seek to the saved seek location and seek forward computing or looking for a matching hash. I woudl also add that the man pages talk about the readdir blocksize a bit and mention the file blocksize (and stat) which is often way dfferent from 512 bytes.. usually 16k or so these days. I found setting the read size to the same as the fs blocksize seemd to work well. This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. yes so a UFS filesystem that is exported could never do garbage collection. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. yes it's the filesystem that knows.. we USED to return empty entries in the dirent list but that was removed recently think. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. ___
Re: seekdir/readdir patch .. Call for Review.
On 5/3/15 10:33 PM, Jilles Tjoelker wrote: On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. thanks your comment is correct, but I don't think it really matters because I'm only claiming to fix a really small set of possible usages.. I might add a sentence in the seekdir man page specifying what does and doesn't work. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On 03.05.2015 16:01, Julian Elischer wrote: >> Before making single-purpose changes to the libc readdir and seekdir >> code, or to the kernel code, it would be useful to state exact behaviour >> of the dirent machinery we want to see. No, 'make samba works in my >> situation' does not sound good enough. > Well samba is a MAJOR application for FreeBSD. there are many people > who combine the two, (e.g. FreeNAS, which suffers from this problem) > so taking it from "The Samba community does not recommend > running on FreeBSD due to problems with seekdir" to "Samba works > correctly on FreeBSD" is important.. Samba's behaviour here is governed > by Windows expectiations. Please look at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=198819 too. Perhaps it is related or may be not (sorry currently I am not able to inspect the code). -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov wrote: > On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > > if you are interested in readdir(3), seekdir(3) and telldir(3) then > > you should look at > >https://reviews.freebsd.org/D2410 > > this patches around a problem in seekdir() that breaks Samba. > > Seekdir(3) will not work as expected when files prior to the point of > > interest in directory have been deleted since the directory was opened. > > Windows clients using Samba cause both these things to happen, causing > > the next readdir(3) after the bad seekdir(3) to skip some entries and > > return the wrong file. > > Samba only needs to step back a single directory entry in the case > > where it reads an entry and then discovers it can't fit it into the > > buffer it is sending to the windows client. It turns out we can > > reliably cater to Samba's requirement because the "last returned > > element" is always still in memory, so with a little care, we can > > set our filepointer back to it safely. (once) > > seekdir and readdir (and telldir()) need a complete rewrite along with > > getdirentries() but that is more than a small edit like this. > Can you explain your expectations from the whole readdir() vs. parallel > directory modifications interaction ? From what I understood so far, > there is unlocked modification of the container and parallel iterator > over the same container. IMO, in such situation, whatever tweaks you > apply to the iterator, it is still cannot be made reliable. > Before making single-purpose changes to the libc readdir and seekdir > code, or to the kernel code, it would be useful to state exact behaviour > of the dirent machinery we want to see. No, 'make samba works in my > situation' does not sound good enough. Consider the subsequence of entries that existed at opendir() time and were not removed until now. This subsequence is clearly defined and does not have concurrency problems. The order of this subsequence must remain unchanged and seekdir() must be correct with respect to this subsequence. Additionally, two other kinds of entries may be returned. New entries may be inserted anywhere in between the entries of the subsequence, and removed entries may be returned as if they were still part of the subsequence (so that not every readdir() needs a system call). A simple implementation for UFS-style directories is to store the offset in the directory (all bits of it, not masking off the lower 9 bits). This needs d_off or similar in struct dirent. The kernel getdirentries() then needs a similar loop as the old libc seekdir() to go from the start of the 512-byte directory block to the desired entry (since an entry may not exist at the stored offset within the directory block). This means that a UFS-style directory cannot be compacted (existing entries moved from higher to lower offsets to fill holes) while it is open for reading. An NFS exported directory is always open for reading. This also means that duplicate entries can only be returned if that particular filename was deleted and created again. Without kernel support, it is hard to get telldir/seekdir completely reliable. The current libc implementation is wrong since the "holes" within the block just disappear and change the offsets of the following entries; the kernel cannot fix this using entries with d_fileno = 0 since it cannot know, in the general case, how long the deleted entry was in the filesystem-independent dirent format. My previous idea of storing one d_fileno during telldir() is wrong since it will fail if that entry is deleted. If you do not care about memory usage (which probably is already excessive with the current libc implementation), you could store at telldir() time the offset of the current block returned by getdirentries() and the d_fileno of all entries already returned in the current block. The D2410 patch can conceptually work for what Samba needs, stepping back one directory entry. I will comment on it. -- Jilles Tjoelker ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On 5/2/15 12:17 AM, Konstantin Belousov wrote: On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. Well samba is a MAJOR application for FreeBSD. there are many people who combine the two, (e.g. FreeNAS, which suffers from this problem) so taking it from "The Samba community does not recommend running on FreeBSD due to problems with seekdir" to "Samba works correctly on FreeBSD" is important.. Samba's behaviour here is governed by Windows expectiations. I am specifically NOT changing kernel code.. though in discussion with jhb and others, it becomes clear that this will have to happen some time in the future. There is no way to make seekdir 100% reliable without going to exposing cookies through getdirentries(), and even then it relies on the file systems doing the right thing with the cookies. Don't let Perfect be the enemy of 'better'. I think hte two changes here make seekdir much more reliable. not only does it make the 'back up one entry' case 100% reliable, but it also makes any backwards seek within the current buffer also work correctly. The expectation is that if you read an entry and then "immediately" set the pointer back by one so that you can read it again, that hte next readdir() will once again return the same entry regardless of whether any deletes have happenned between the seekdir and the readdir() In other words if we 'pre-read' an item to find its size, and then read it again.. we get the same item. the change is small, makes it more reliable and helps a number of FreeBSD users (e.g. FreeNAS). I don't see why it would be objectionable to anyone. unless you are relying an samba failing to delete random files when you try delete a subdirectory. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: seekdir/readdir patch .. Call for Review.
On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > if you are interested in readdir(3), seekdir(3) and telldir(3) then > you should look at >https://reviews.freebsd.org/D2410 > > this patches around a problem in seekdir() that breaks Samba. > Seekdir(3) will not work as expected when files prior to the point of > interest in directory have been deleted since the directory was opened. > > Windows clients using Samba cause both these things to happen, causing > the next readdir(3) after the bad seekdir(3) to skip some entries and > return the wrong file. > > Samba only needs to step back a single directory entry in the case > where it reads an entry and then discovers it can't fit it into the > buffer it is sending to the windows client. It turns out we can > reliably cater to Samba's requirement because the "last returned > element" is always still in memory, so with a little care, we can > set our filepointer back to it safely. (once) > > seekdir and readdir (and telldir()) need a complete rewrite along with > getdirentries() but that is more than a small edit like this. Can you explain your expectations from the whole readdir() vs. parallel directory modifications interaction ? From what I understood so far, there is unlocked modification of the container and parallel iterator over the same container. IMO, in such situation, whatever tweaks you apply to the iterator, it is still cannot be made reliable. Before making single-purpose changes to the libc readdir and seekdir code, or to the kernel code, it would be useful to state exact behaviour of the dirent machinery we want to see. No, 'make samba works in my situation' does not sound good enough. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
seekdir/readdir patch .. Call for Review.
if you are interested in readdir(3), seekdir(3) and telldir(3) then you should look at https://reviews.freebsd.org/D2410 this patches around a problem in seekdir() that breaks Samba. Seekdir(3) will not work as expected when files prior to the point of interest in directory have been deleted since the directory was opened. Windows clients using Samba cause both these things to happen, causing the next readdir(3) after the bad seekdir(3) to skip some entries and return the wrong file. Samba only needs to step back a single directory entry in the case where it reads an entry and then discovers it can't fit it into the buffer it is sending to the windows client. It turns out we can reliably cater to Samba's requirement because the "last returned element" is always still in memory, so with a little care, we can set our filepointer back to it safely. (once) seekdir and readdir (and telldir()) need a complete rewrite along with getdirentries() but that is more than a small edit like this. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"