On 5/6/15 7:33 AM, Rick Macklem wrote:
Julian Elischer wrote:
On 5/3/15 10:33 PM, Jilles Tjoelker wrote:
On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov
On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote:
if you are interested in readdir(3), seekdir(3) and telldir(3)
you should look at
this patches around a problem in seekdir() that breaks Samba.
Seekdir(3) will not work as expected when files prior to the
point of
interest in directory have been deleted since the directory was
Windows clients using Samba cause both these things to happen,
the next readdir(3) after the bad seekdir(3) to skip some entries
return the wrong file.
Samba only needs to step back a single directory entry in the
where it reads an entry and then discovers it can't fit it into
buffer it is sending to the windows client. It turns out we can
reliably cater to Samba's requirement because the "last returned
element" is always still in memory, so with a little care, we can
set our filepointer back to it safely. (once)
seekdir and readdir (and telldir()) need a complete rewrite along
getdirentries() but that is more than a small edit like this.
Can you explain your expectations from the whole readdir() vs.
directory modifications interaction ?  From what I understood so
there is unlocked modification of the container and parallel
over the same container.  IMO, in such situation, whatever tweaks
apply to the iterator, it is still cannot be made reliable.
Before making single-purpose changes to the libc readdir and
code, or to the kernel code, it would be useful to state exact
of the dirent machinery we want to see. No, 'make samba works in
situation' does not sound good enough.
Consider the subsequence of entries that existed at opendir() time
were not removed until now. This subsequence is clearly defined and
not have concurrency problems. The order of this subsequence must
unchanged and seekdir() must be correct with respect to this

Additionally, two other kinds of entries may be returned. New
may be inserted anywhere in between the entries of the subsequence,
removed entries may be returned as if they were still part of the
subsequence (so that not every readdir() needs a system call).

A simple implementation for UFS-style directories is to store the
in the directory (all bits of it, not masking off the lower 9
This needs d_off or similar in struct dirent. The kernel
then needs a similar loop as the old libc seekdir() to go from the
of the 512-byte directory block to the desired entry (since an
entry may
not exist at the stored offset within the directory block).
At least it needs some more information in struct dirent than there
A cookie is the current fashion..   that assumes however that the
are capable of converting backwards from cookie to 'location'. ZFS
claims to
be able to do so..
My current plan for a patch is...
- d_cookie would be the "physical" file system position of the next
   directory entry
- a ngetdirentries() would take a "physical" cookie as a value/result
   argument. It would indicate where to start and would return the cookie
   for the next entry after what is returned. (It could probably be stuffed
   in uio_offset, but I think it might be clearer to make it a separate arg.)
   --> It would pass this physical cookie down to the file system's
       VOP_READDIR(). (For UFS it would be the byte offset in the on-disk
       directory. For ZFS, I believe it is an index for the entry. For
       NFS, it is the cookie that is sent to the server. For others, I
       don't yet know.)
- dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and
   loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could
   just return the cookie and forget about the loc_XX structures?)
This would get rid of the loc_seek, loc_loc bogosity that no longer makes
much sense, since the byte offset in what is returned by getdirentries()
has little to do with the "physical" directory position.

I have already done the kernel stuff for some file systems and the libc
changes actually simplify things compared to what is there now.

zfs already has a cookie production facility as part of the VFS readdir (or is it dir-read?)


Thee other thing to do would be to store some sort
of strong
hash of the name and inode# in each telldir entry..
we would seek to the saved seek location and seek forward computing
for a matching hash. I woudl also add that the man pages talk about
readdir blocksize a bit and mention the file blocksize (and stat)
which is often
way dfferent from 512 bytes.. usually 16k or so these days.
I found setting the read size to the same as the fs blocksize seemd
work well.
This means that a UFS-style directory cannot be compacted (existing
entries moved from higher to lower offsets to fill holes) while it
open for reading. An NFS exported directory is always open for
yes so a UFS filesystem that is exported could never do garbage
This also means that duplicate entries can only be returned if that
particular filename was deleted and created again.

Without kernel support, it is hard to get telldir/seekdir
reliable. The current libc implementation is wrong since the
within the block just disappear and change the offsets of the
entries; the kernel cannot fix this using entries with d_fileno = 0
since it cannot know, in the general case, how long the deleted
was in the filesystem-independent dirent format.
yes it's the filesystem that knows.. we USED to return empty entries
in the dirent list but that was removed recently think.
   My previous idea of
storing one d_fileno during telldir() is wrong since it will fail
that entry is deleted.

If you do not care about memory usage (which probably is already
excessive with the current libc implementation), you could store at
telldir() time the offset of the current block returned by
getdirentries() and the d_fileno of all entries already returned in
current block.

The D2410 patch can conceptually work for what Samba needs,
back one directory entry. I will comment on it.

freebsd-current@freebsd.org mailing list
To unsubscribe, send any mail to

freebsd-current@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to