[Kernel-packages] [Bug 506798] Re: du crashes when traversing nfs mounted .snapshot directories

Bug Watch Updater Fri, 27 Oct 2017 15:13:53 -0700

Launchpad has imported 66 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=501848.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2009-05-21T02:29:39+00:00 Issue wrote:

Escalated to Bugzilla from IssueTracker

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/0

------------------------------------------------------------------------
On 2009-05-21T02:29:41+00:00 Issue wrote:

Description of problem:
If you run a du -h on a directory with .snapshot sub directories with
coreutils-6.10+ (Could be lower, but >5.97-20) you will get a fts_read error:
du: fts_read failed: No such file or directory

How reproducible:
Every time.

Steps to Reproduce:
1. Use F10 or anything with the higher versions of coreutils. and a machine
with the .snapshot directories created by netapp.
2. du -h
3. wait.

Actual results:
du: fts_read failed: No such file or directory

Expected results:
The size listing of all files and/or directories

Additional info:
This event sent from IssueTracker by cwyse [Pixar Animation Studios - Fedora
Queue]
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/1

------------------------------------------------------------------------
On 2009-05-21T03:18:21+00:00 Issue wrote:

I guess I spoke to soon. With coreutils-6.9-2 the problem is less
noticeable. On smaller directories it doesn't show up at all. But with
larger directories (directories with many sub directories) it is still
there, so some users will notice it and some will not. Problem is now
between:
coreutils-5.97-19 > and < coreutils-6.9-2.

I tried compiling the 6.7.* coreutils package but it keeps failing on
build and isn't saying what or why. I will look more into this and
update with what I find.

This event sent from IssueTracker by cwyse
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/2

------------------------------------------------------------------------
On 2009-05-21T07:36:22+00:00 Kamil wrote:

It can be caused by the on-the-fly changes within the directory. It just
want to traverse a directory (or file?) which no more exists. I am
pretty sure you can't see the errors if you mount the file system read-
only.

But there is no doubt the error message might be more verbose, it is
listed as FIXME in du.c.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/3

------------------------------------------------------------------------
On 2009-05-21T09:44:01+00:00 Ondrej wrote:

Additionally - I guess Fedora version should be changed to something not
EOL (as F-8 is EOL and F-9 will be EOL in ~2 months). From the comments
I think version should be changed to F-10 - correct? Or some RHEL
version?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/4

------------------------------------------------------------------------
On 2009-05-21T09:47:01+00:00 Ondrej wrote:

Additionally strace from the failure could be useful to better analyze
what's the culprit...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/5

------------------------------------------------------------------------
On 2009-05-21T17:14:53+00:00 Charlie wrote:

Created attachment 344994
strace of failure

I agree, changing version to F10, I originally just set it for the first
version I noticed this problem in. Also, here is an strace of the
failure on a F10 machine. I'm gonna try some of the 6.7 packages again
and see if I can narrow down the window in which this fails.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/6

------------------------------------------------------------------------
On 2009-05-21T17:49:54+00:00 Issue wrote:

Finally got 6.7-1 compiled. It shows the same fts_read issue.
5.97-22 >< 6.7-1
This is about as narrowed as I can get it, I'm gonna try diff'ing up a
patch between du.c and... see what happens.

This event sent from IssueTracker by cwyse
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/7

------------------------------------------------------------------------
On 2009-05-21T18:46:02+00:00 Issue wrote:

patching fts.c was a fail on compile. There looks like a fts.c.du file in
6.7 and a fts.c.inaccessibledirs in 5.7. Since the files do not exist in
both trees I'm kinda not sure how to test patching that.

This event sent from IssueTracker by cwyse
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/8

------------------------------------------------------------------------
On 2009-05-21T19:55:42+00:00 Kamil wrote:

Could you please try following on the same directory?

$ find -printf %b\\n

Does it give the same errors? Different errors? No errors?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/9

------------------------------------------------------------------------
On 2009-05-21T20:54:59+00:00 Issue wrote:

Ran "find -printf %b\n" I didn't get any errors, it took over an hour
and ran my cpu at 121.6%, but no errors, running it again to verify.

This event sent from IssueTracker by cwyse
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/10

------------------------------------------------------------------------
On 2009-05-21T21:23:49+00:00 Kamil wrote:

Does 'du' print just one error message and then die? Is the output
obviously incomplete? Or the problem is only about the error message and
return code?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/11

------------------------------------------------------------------------
On 2009-06-10T12:03:46+00:00 Ondrej wrote:

Something interesting to read (about the same issue and how to reduce impact):
http://www.unixtutorial.org/2009/02/troubleshooting-du-fts_read-no-such-file-or-directory-error/

>From what I have quickly checked, if find's fts_read() returns NULL, it
just closes FTS structure and goes to next argument. If du's fts_read()
returns NULL, it checks for errno - and spits corresponding diagnostics.
Difference is in checking function - find has a bit more complex
checking function consider_visiting() - maybe some parts from it should
be used/adapted in du's process_file() function.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/12

------------------------------------------------------------------------
On 2009-07-20T19:08:25+00:00 Ondrej wrote:

Played a bit with that bz again - fts_read error is being set on
lib/fts.c:2000 - hardcoding ENOENT to errno. Error occurs when ".."
entry is not cached yet, so with more repeating after mount, it seems to
be possible to get rid off those errors and to get correct result. It
seems that check on fts.c:1997/1998 has to be extended to handle
properly that situation with NetApp .snapshot dir. Using du -Lsh also
helps in some cases.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/13

------------------------------------------------------------------------
On 2009-07-21T11:27:05+00:00 Ondrej wrote:

Created attachment 354463
Workaround for ".." directories and ?race conditions?

Played a bit more with that fts_read failure, attached patch is
workarounding the issue. It seems that due to "maybe caching race
condition" after fstat on ".." fts entry it sometimes has device number
of the parent directory (first run after mount).

e.g. (variable: fts_value : fstat_value):
devicenum: 25 : 33
inode: 8217100 : 8217100
Next run on the same place has correctly same values for fts_value and
fstat_value and it looks like:
devicenum: 33 : 33
inode: 8217100 : 8217100

I'm quite sure that patch is NOT correct way how to solve that issue,
that race condition should be eliminated - but I'm not really sure
where. Filesystem? Kamil - any idea?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/14

------------------------------------------------------------------------
On 2009-07-21T11:37:27+00:00 Ondrej wrote:

Created attachment 354464
Better one ;) workaround for ".." directories and ?race conditions?

Damned, previous one was obviously not correct ... that one should be
better...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/15

------------------------------------------------------------------------
On 2009-07-21T19:31:50+00:00 Charlie wrote:

I added the patch to the latest coreutils package and I haven't seen the
error yet. I ran a du -h over my lunch break. I'm letting my customer
try it out and give it his stamp of approval. But so far it looks like
it resolves the issue. I'll let you know if anything changes.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/16

------------------------------------------------------------------------
On 2009-07-22T14:01:18+00:00 Kamil wrote:

I've narrowed down the strange behavior to sort of minimal example
(/mnt/archive is a NetApp mount point):

umount /mnt/archive && mount /mnt/archive \
&& stat --printf "%d\t%i\t%n\n" /mnt/archive/.snapshot \
&& stat --printf "%d\t%i\t%n\n" /mnt/archive/.snapshot/hourly.0 \
&& stat --printf "%d\t%i\t%n\n" /mnt/archive/.snapshot

The output is following:

20 67 /mnt/archive/.snapshot
26 222 /mnt/archive/.snapshot/hourly.0
26 67 /mnt/archive/.snapshot

The device number is being changed on the fly while the inode number
stays unchanged. It sounds like a file system bug to me. It's 100%
reproducible on my box.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/17

------------------------------------------------------------------------
On 2009-07-22T18:36:17+00:00 Charlie wrote:

I ran the patched coreutils package on my .snapshot directory 3 times
and didn't see a single error. It takes about 30 minutes to go through
the .snapshot directory. Before Ovasik's patch it would run for about
30 seconds then fail. Kdudka, are you using the patch and still
noticing this?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/18

------------------------------------------------------------------------
On 2009-07-22T18:51:16+00:00 Issue wrote:

Event posted on 07-22-2009 02:51pm EDT by cwyse

Customer just got back to me with some comments. This new package creates
.snapshot directories on his desktop. This was a problem in F10 which
went away with F11. So it looks like a slight regression?

This event sent from IssueTracker by cwyse
issue 298936

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/19

------------------------------------------------------------------------
On 2009-07-22T18:53:17+00:00 Charlie wrote:

Here are the previous bugs that were related to the .snapshots showing
up on the desktop. Just posting them here in case they help.

As noted in https://bugzilla.redhat.com/show_bug.cgi?id=472778 and
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb44598, NetApp
filers use different FSIDs for the hidden snapshot directories they
provide.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/20

------------------------------------------------------------------------
On 2009-07-23T10:52:33+00:00 Kamil wrote:

(In reply to comment #21)
> Kdudka, are you using the patch and still noticing this?

The patch is only workaround for 'du' utility. It works for me, too. But
it does not fix the file system bug. The minimal example uses 'stat', so
it has nothing to do with that patch.

The comment #22 is missing some context here. Which package does create
the .snapshot directories on customer's desktop? I am quite sure that
coreutils does not.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/21

------------------------------------------------------------------------
On 2009-08-05T14:53:25+00:00 Kamil wrote:

The problem persists with latest rawhide kernel:
Linux 2.6.31-0.122.rc5.git2.fc12.x86_64 #1 SMP Mon Aug 3 12:58:47 EDT 2009
x86_64

/etc/fstab:
filer-eng.brq.redhat.com:/vol/engineering/share /mnt/archive nfs ro 0 0

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/22

------------------------------------------------------------------------
On 2009-08-05T17:34:32+00:00 Jeff wrote:

Looking at the capture, it doesn't appear that the server is returning
inconsistent inode info. However Kamil's reproducer seems to indicate
that the client is changing the device number after it traverses into
the directory.

I suspect that this means that the client isn't doing the shrinkable
mount before returning the info on the first stat call.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/23

------------------------------------------------------------------------
On 2009-09-02T11:54:16+00:00 Jeff wrote:

Confirmed...same behavior in rawhide too. I can also reproduce this with
a non-netapp server simply by exporting a filesystem and then exporting
another filesystem mounted onto a subdir of the first fs. Nothing
netapp-specific here.

There's also a somewhat related problem...if a submount is done and then
gets automatically unmounted, then the device numbers can change and
even be reused for a completely different submount.

This is a bit tricky. On the one hand, the device number seems to change
and that's probably bad for some apps. On the other hand, do we really
want to trigger a mount just because someone did a stat() on the
directory where we would eventually do a submount?

If I have a ton of exports that are subdirs of another exported
filesystem I don't think I really want to do submounts of all of those
filesystems just because someone did a "ls -l" in that directory.

Unfortunately, the device numbers for NFS are allocated on the fly
during mount. So we can't easily "fake up" the device numbers and expect
them to remain consistent without actually triggering a mount. The
device number may be different once the submount gets done.

I suspect that the best we can probably do is to just make sure the
device number is different from that of the parent filesystem, but we
probably won't be able to make it consistent. That is, it'll change as
soon as you walk into the dir...

I'll plan to do a writeup of this problem in the near future and post it
to the upstream mailing list.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/24

------------------------------------------------------------------------
On 2009-09-02T16:47:45+00:00 Jeff wrote:

This problem is really no different than how autofs works. When you run
stat on an autofs mountpoint, you'll just get the directory until you
walk into that directory.

That's actually correct behavior since you're adding a new mount when
that occurs. This is almost completely the same thing, it's just that
the kernel does a new mount w/o needing autofs.

I'm not sure this is actually bug, rather you're just seeing expected
results when the kernel adds a new mount on the fly.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/25

------------------------------------------------------------------------
On 2009-09-02T17:07:37+00:00 Kamil wrote:

Jeff, thanks for the analysis. I'll look at the fts code again and
possibly reassign back to coreutils. Good to know it's reproducible
independently on the NetApp mount point.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/26

------------------------------------------------------------------------
On 2009-09-03T17:55:10+00:00 Jeff wrote:

Sounds good. I'll reassign this back to you for now.

Let me know if you need further clarification.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/27

------------------------------------------------------------------------
On 2009-10-18T15:51:21+00:00 Kamil wrote:

making the bug public...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/28

------------------------------------------------------------------------
On 2009-10-21T18:27:51+00:00 Kamil wrote:

Reported upstream:
http://lists.gnu.org/archive/html/bug-gnulib/2009-10/msg00207.html

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/29

------------------------------------------------------------------------
On 2009-10-31T12:11:36+00:00 Jim wrote:

Hello,

Is this happening because the device number is assigned first to one
value initially, and later to another value -- all during a single
hierarchy traversal?

If so, I'll have to push this back into the kernel/file-system court.
I think we'll have to make the file system present a consistent device and
inode number for any file it serves.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/30

------------------------------------------------------------------------
On 2009-10-31T12:34:27+00:00 Kamil wrote:

(In reply to comment #40)
> Is this happening because the device number is assigned first to one value
> initially, and later to another value -- all during a single hierarchy
> traversal?

It looks like a sort of expected behavior to me. If the file system is
not mounted, the device number describes the directory which belongs to
the surrounding file system. Once you trigger the mount, the same path
(directory) belongs to the newly mounted file system, thus gets a new
device number.

In fact I was more likely surprised how the inode number could stay
consistent among the mounts.

> If so, I'll have to push this back into the kernel/file-system court.
> I think we'll have to make the file system present a consistent device and
> inode number for any file it serves.

Well, I try to prepare a complete client/server reproducer first since
the one from comment #20 uses our internal server, not available to
others for testing.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/31

------------------------------------------------------------------------
On 2009-10-31T12:58:53+00:00 Jim wrote:

What event triggers the mount?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/32

------------------------------------------------------------------------
On 2009-10-31T13:13:44+00:00 Kamil wrote:

(In reply to comment #42)
> What event triggers the mount?

>From my observation with gdb:
1. calling fstatat() with AT_SYMLINK_NOFOLLOW does NOT trigger the mount.
2. calling fstatat() without AT_SYMLINK_NOFOLLOW triggers the mount, opening a
directory as well.

If you are asking which events are guaranteed to trigger the mount
and/or which events are guaranteed to NOT trigger the mount, kernel guys
might give you a reliable answer.

Jeff, any idea?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/33

------------------------------------------------------------------------
On 2009-11-02T19:01:06+00:00 Jeff wrote:

Submounts are triggered via the follow_link inode operation, so in some
ways these are treated like symlinks...

The short answer is that the mount will be triggered whenever you walk a
path in such a way that, if this component were a symlink it would be
resolved to its target.

Longer answer:

If the place where you transition into a new filesystem is in the middle
of a path, then generally the path will be resolved. If it's the last
component of the path, then it depends on whether the LOOKUP_FOLLOW link
flag is set in nameidata in the kernel. That varies with the type of
operation -- for instance, lstat() won't have that set, but a "normal"
stat() generally will.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/34

------------------------------------------------------------------------
On 2009-11-03T12:41:30+00:00 Kamil wrote:

Minimal example which works reliably on my Fedora 11 installation:

# mount | grep ^/
/dev/sda1 on / type ext3 (rw)
/dev/sda3 on /home type ext4 (rw)

# ls -d /home/test
/home/test

# printf "/ *(fsid=0,crossmnt)\n/home *(crossmnt)\n" \
> /etc/exports

# service nfs restart
# mkdir /tmp/mnt
# mount -t nfs4 localhost:/ /tmp/mnt \
&& stat --printf "%d\t%i\t%n\n" /tmp/mnt/home \
&& stat --printf "%d\t%i\t%n\n" /tmp/mnt/home/test \
&& stat --printf "%d\t%i\t%n\n" /tmp/mnt/home

29 2 /tmp/mnt/home
30 12 /tmp/mnt/home/test
30 2 /tmp/mnt/home

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/35

------------------------------------------------------------------------
On 2009-11-03T20:22:57+00:00 Kamil wrote:

A patch for gnulib proposed upstream:

http://lists.gnu.org/archive/html/bug-gnulib/2009-11/msg00027.html

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/36

------------------------------------------------------------------------
On 2009-11-04T16:37:03+00:00 Kamil wrote:

(In reply to comment #46)
> A patch for gnulib proposed upstream:
>
> http://lists.gnu.org/archive/html/bug-gnulib/2009-11/msg00027.html

The patch has been rejected by upstream because of performance impact in
some obscure situations (namely traversing a directory which consists of
200000 directories nested in each other):

http://lists.gnu.org/archive/html/bug-gnulib/2009-11/msg00032.html

As solution it was proposed to find (or perhaps implement?) a low cost
way of recognizing a mount point during the traversal. "low cost" means
cheaper than a stat call here.

Since there seems to be nothing I can do with this bug at the moment, I
am reassigning it back to kernel.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/37

------------------------------------------------------------------------
On 2009-11-04T18:32:30+00:00 Jim wrote:

Hi Kamil,

Using your reproducer (above, thanks!) let's print one more dev/ino pair
(this is on F12):

$ stat --printf "%d %i %n\n" /tmp/mnt/home /tmp/mnt
24 2 /tmp/mnt/home
24 2 /tmp/mnt

That shows a big problem: two distinct directories have the same dev/ino pair,
and fts rightly objects, returning FTS_DC to indicate the directory cycle.
Because when fts encounters the same dev/ino pair twice in a traversal, and
when not traversing symlinks, that represents a hard-linked directory cycle,
which is usually a big problem. [Note that currently du does not diagnose this
problem, but I'll fix that shortly. ]

Even if the above kernel/nfs bug is fixed, I am becoming more and more
convinced that this varying-device-number problem is something that must
be addressed in the kernel, and not in every single application that
must perform dev/ino checks for security. Thanks for reassigning to the
kernel.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/38

------------------------------------------------------------------------
On 2009-11-04T18:51:58+00:00 Kamil wrote:

(In reply to comment #48)
> $ stat --printf "%d %i %n\n" /tmp/mnt/home /tmp/mnt
> 24 2 /tmp/mnt/home
> 24 2 /tmp/mnt

Good catch! Though I don't think you hit the cause of the original bug
report, this looks indeed broken. The dev/ino pair should be unique per
whole VFS, or am I wrong?

Jeff, what do think about the example?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/39

------------------------------------------------------------------------
On 2009-11-04T19:18:08+00:00 Jeff wrote:

I'd have to look at the example more closely, but it's likely that the
kernel code is picking up the inode number of the root inode of the
underlying filesystem.

I think what's happening is that the server sends the inode number of
/tmp/mnt/home and a new fsid, but the client doesn't actually spawn a
new submount there. So the device ID ends up the same. In fact, all of
my ext3/4 filesystems seem to give the root inode st_ino == 2, so that's
probably what's happening.

The trivial workaround here is to probably use stat() instead of lstat()
here (-L option to the stat program), but I imagine that won't be
suitable?

How to fix this? I don't think there is a way to do so without
triggering a submount even when we don't want to follow symlinks.

That's going to be very costly for performance in many cases (if it's
even reasonably doable). Imagine cd'ing into a directory that has a 1000
exported filesystems under it. Simply doing a readdir() in there is
going to make the client spawn 1000 new mounts.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/40

------------------------------------------------------------------------
On 2009-11-04T19:34:32+00:00 Kamil wrote:

(In reply to comment #50)
> The trivial workaround here is to probably use stat() instead of lstat() here
> (-L option to the stat program), but I imagine that won't be suitable?

Yep, this suppresses the bug as well as du -L in the original bug
report. But we get a different result, so it's really not suitable.

> How to fix this? I don't think there is a way to do so without triggering a
> submount even when we don't want to follow symlinks.

I think this *should* be fixed since it breaks one of the basic axioms
about VFS.

> That's going to be very costly for performance in many cases (if it's even
> reasonably doable). Imagine cd'ing into a directory that has a 1000 exported
> filesystems under it. Simply doing a readdir() in there is going to make the
> client spawn 1000 new mounts.

No chance to get unique dev/ino pairs without triggering the mount
first?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/41

------------------------------------------------------------------------
On 2009-11-04T19:50:18+00:00 Peter wrote:

No, sorry, no way to determine what the ino is for the new file system
without talking to the server.

Doing an ls in a directory full of many autofs mounted file systems
should not trigger mounts for all of those file systems. This will
cause a bigger performance problem than the original perceived
problem ever did.

Perhaps the right way to address this is to flag the returned
directory entries to the user level with something which indicates
that the metadata information for that entry will change if the
file system which would be mounted there was actually mounted
there. This would eliminate most of the extra stat calls that Jim
Meyering is worried about.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/42

------------------------------------------------------------------------
On 2009-11-04T20:07:51+00:00 Jim wrote:

FYI, I've (re)raised the issue on LKML:

http://lkml.org/lkml/2009/11/4/451

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/43

------------------------------------------------------------------------
On 2009-11-04T20:13:13+00:00 Jeff wrote:

Minor nit...we get the correct st_ino for the directory. The problem is
that we don't have accurate st_dev info at that point since the mount
hasn't occurred yet.

That said...it would be nice to be able to flag the entries in the way
that Peter suggests. The question is how to do that in a way that's
compatible with POSIX here.

Maybe we could declare a new S_IF* value for st_mode:

S_IFXDEV 020000

That should allow us to leave the S_IFDIR bit set and it employs a bit
that's outside of __S_IFMT. The kernel could set this bit in the statbuf
when it detects that the fsid on the inode is not the same as that of
the parent directory.

The big question is whether and if someone wants to implement this and
then sell it upstream :)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/44

------------------------------------------------------------------------
On 2009-11-05T16:18:53+00:00 Kamil wrote:

Another question is how coreutils will detect that running kernel has
the ability to indicate mount points, thus decide whether to use the
optimization or not.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/45

------------------------------------------------------------------------
On 2009-11-05T16:35:56+00:00 Peter wrote:

If an approach similar to what Jeff has suggested, then it won't matter.
If the kernel sets S_IFXDEV,then coreutils can use the optimization. If
it doesn't, then it won't?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/46

------------------------------------------------------------------------
On 2009-11-05T16:50:03+00:00 Kamil wrote:

Nope, if I understand it correctly, the semantic of S_IFXDEV bit is
exactly opposite. If the bit is set, we need to call stat again after
opening a directory. But if it's not set and we don't know if the kernel
provides this feature, we can't use the optimization and need to call
stat anyway. Or am I wrong?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/47

------------------------------------------------------------------------
On 2009-11-05T17:06:58+00:00 Peter wrote:

Yes, sorry, was looking at the other way around.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/48

------------------------------------------------------------------------
On 2009-11-05T17:21:09+00:00 Kamil wrote:

I think we need either a bit with exactly inverse value, or another
equipment indicating that kernel is able to set the S_IFXDEV bit
reliably.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/49

------------------------------------------------------------------------
On 2009-11-07T11:54:03+00:00 Jim wrote:

(In reply to comment #48)
> Using your reproducer (above, thanks!) let's print one more dev/ino pair
> (this is on F12):
>
> $ stat --printf "%d %i %n\n" /tmp/mnt/home /tmp/mnt
> 24 2 /tmp/mnt/home
> 24 2 /tmp/mnt
>
> That shows a big problem: two distinct directories have the same dev/ino pair,

FYI, I've opened a new BZ to track this separate problem:

https://bugzilla.redhat.com/show_bug.cgi?id=533569

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/51

------------------------------------------------------------------------
On 2010-04-27T14:26:07+00:00 Bug wrote:

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 11 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/79

------------------------------------------------------------------------
On 2010-06-28T12:38:09+00:00 Bug wrote:

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/80

------------------------------------------------------------------------
On 2010-06-28T13:59:17+00:00 Jim wrote:

I wish it could be closed...
Still afflicts rawhide.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/81

------------------------------------------------------------------------
On 2010-07-30T10:39:47+00:00 Bug wrote:

This bug appears to have been reported against 'rawhide' during the Fedora 14
development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/82

------------------------------------------------------------------------
On 2010-11-24T16:57:34+00:00 Jim wrote:

still affects rawhide.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/85

------------------------------------------------------------------------
On 2011-01-11T13:47:33+00:00 Kamil wrote:

(In reply to comment #45)
> # mount -t nfs4 localhost:/ /tmp/mnt \
> && stat --printf "%d\t%i\t%n\n" /tmp/mnt/home \
> && stat --printf "%d\t%i\t%n\n" /tmp/mnt/home/test \
> && stat --printf "%d\t%i\t%n\n" /tmp/mnt/home
>
> 29 2 /tmp/mnt/home
> 30 12 /tmp/mnt/home/test
> 30 2 /tmp/mnt/home

FYI I tried the same example on my RHEL-5 machine and, surprisingly,
there seems to be no such optimization. The first lstat() syscall on
/tmp/mnt/home triggers the the mount of /tmp/mnt/home and picks the
final dev/ino pair.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/86

------------------------------------------------------------------------
On 2011-01-11T14:01:37+00:00 Kamil wrote:

... but it is still reproducible with autofs mount points even on
RHEL-5.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/87

------------------------------------------------------------------------
On 2011-01-11T18:55:31+00:00 Jeff wrote:

I concur. I can't reproduce this any more either on nfsv4:

# mount /mnt/dantu && stat --printf "%d\t%i\t%n\n" /mnt/dantu && stat --printf
"%d\t%i\t%n\n" /mnt/dantu/ext3 && stat --printf "%d\t%i\t%n\n"
/mnt/dantu/ext3/testfile && stat --printf "%d\t%i\t%n\n" /mnt/dantu/ext3
24 2 /mnt/dantu
25 2 /mnt/dantu/ext3
25 49153 /mnt/dantu/ext3/testfile
25 2 /mnt/dantu/ext3

...in my setup the host exports a filesystem and "ext3" is a mounted and
exported filesystem under that. It seems like something has changed and
now lstat() calls are triggering the mount. I'm going back through the
changelogs now to see why it's different now.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/88

------------------------------------------------------------------------
On 2011-01-11T18:59:18+00:00 Jeff wrote:

I should point out that those last results were with my latest RHEL5
test kernels.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/89

------------------------------------------------------------------------
On 2011-01-11T19:13:36+00:00 Kamil wrote:

Jeff, sorry if my comment was confusing, but I think we both have
exactly same results. This bug (501848) is against Fedora. RHEL-5
didn't repeat the the bug with nfsv4 for me, but I am still able to
reproduce it on RHEL-5 with autofs. I wrote the comment here only as an
auxiliary observation while investigating bug 537463 , which is against
RHEL-5.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/90

------------------------------------------------------------------------
On 2011-01-11T19:44:42+00:00 Jeff wrote:

No problem. It wasn't confusing. Steve asked me to have a look at this
and I was just surprised that I was unable to reproduce this on recent
RHEL5 kernels with NFSv4. Not sure why that is so far...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/91

------------------------------------------------------------------------
On 2013-04-03T19:57:42+00:00 Fedora wrote:

This bug appears to have been reported against 'rawhide' during the Fedora 19
development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora
19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19
End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/94

------------------------------------------------------------------------
On 2013-04-05T15:52:38+00:00 Justin wrote:

Is this still a problem with 3.9 based F19 kernels?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/95

------------------------------------------------------------------------
On 2013-04-23T17:26:31+00:00 Justin wrote:

This bug is being closed with INSUFFICIENT_DATA as there has not been a
response in 2 weeks. If you are still experiencing this issue,
please reopen and attach the relevant data from the latest kernel you are
running and any data that might have been requested previously.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/96

------------------------------------------------------------------------
On 2013-04-23T22:03:13+00:00 Kamil wrote:

The problem still exists in kernel-3.9.0-0.rc7.git3.1.fc20.x86_64. The
reproducer from comment #45 works for me:

[root@f20 ~]# mount -t nfs4 localhost:/ /tmp/mnt && stat --printf
"%d\t%i\t%n\n" /tmp/mnt/boot && stat --printf "%d\t%i\t%n\n"
/tmp/mnt/boot/grub2 && stat --printf "%d\t%i\t%n\n" /tmp/mnt/boot
36 2 /tmp/mnt/boot
37 65025 /tmp/mnt/boot/grub2
37 2 /tmp/mnt/boot

Reply at:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/comments/97

** Changed in: coreutils (Fedora)
Status: Unknown => In Progress

** Changed in: coreutils (Fedora)
Importance: Unknown => Medium

** Bug watch added: Red Hat Bugzilla #472778
https://bugzilla.redhat.com/show_bug.cgi?id=472778

--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/506798

Title:
du crashes when traversing nfs mounted .snapshot directories

Status in coreutils package in Ubuntu:
Triaged
Status in findutils package in Ubuntu:
Triaged
Status in linux package in Ubuntu:
Confirmed
Status in coreutils package in Fedora:
In Progress
Status in linux package in Fedora:
Won't Fix

Bug description:
Binary package hint: coreutils

I'm getting a problem where du errors (and exits) with "du: fts_read
failed: no such file or directory" when traversing a directory with a
NetApp ".snapshot" directory.

My understanding (clarified by the discussions linked bellow) is that:

1) The device ID/inode of a directory is recorded before the submount is made.
2) The device ID of the directory changes after the directory has been read
(via readdir which causes the submount)
3) After examining the contents of the directory du goes back up the tree
(via '..') finds the device ID doesn't match what it has recorded and assumes
things have been moved around under it and bails for safety reasons.

I've researched online and this is an upstream bug. We're using
Ubuntu 9.10 so I feel there should be a bug in the Ubuntu system.

The best information I've found is within Redhat's bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=501848
https://bugzilla.redhat.com/show_bug.cgi?id=533569

This bug has also been discussed on the coreutils mailing list:

http://lists.gnu.org/archive/html/bug-gnulib/2009-11/msg00027.html
http://lists.gnu.org/archive/html/bug-gnulib/2009-11/msg00032.html

and LKML:

http://lkml.org/lkml/2009/11/4/451

Unfortunately none of these discussions has resulted in a widely
accepted solution.

We use NetApp .snapshots very extensively and can't afford for du to
be unreliable. At the moment we will either have to patch du or
downgrade all of coreutils to an older version.

For comparison we are upgrading from Ubunto 7.04 which works
perfectly.

There is a similar problem with find, but it has a --without-fts build
option which 'fixes' it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/506798/+subscriptions

--
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 506798] Re: du crashes when traversing nfs mounted .snapshot directories

Reply via email to