Hi Mark,

Very detailed bug report.  Never apologize for that!

The O_TRUNC is there to avoid a race condition where someone has created the file and put data in it between the two open()s, I would guess.

I believe that there have been some recent improvements in O_APPEND support, but Sam or Rob will recall better than I'd bet.

The problem with O_APPEND is that technically we're supposed to note if some other process, on some other node, adds to the file. But your code doesn't test that functionality (which I guarantee will not work). We had some problems with how we had implemented O_APPEND, so your bug report isn't a complete surprise.

Murali is probably best qualified to comment on the difference in the kernels, but he is on vacation for the near future.

So we wait for someone better prepared to answer (I'm on vacation actually, or I would wade through the CVS logs and figure it out).

Thanks for the report, and I think the answer is going to be that we have a handle on this one.

Rob

Mark Bartelt wrote:
Hi, PVFS-ers ...  Sorry about the length of this,  but it
seems better to be verbose and complete than to leave out
some pertinent details.  Anyway ...

We're running PVFS2 (1.3.2); metadata servers and storage
nodes are x86 systems; the PVFS clients are Opteron-based
systems from HP, runing SuSE SLES9.

Somebody reported problems appending to a file which lived
in the /pvfs hierarchy, using the shell's ">>" redirection
of standard output; the file contents were garbled.

Eventually, we discovered that although the problm existed
for all sh-family shells (sh,ksh,bash), appending worked OK
when csh or tcsh was used.

The fact that ">>" appending works OK for csh-family shells
but not for sh-family ones isn't really so much a function
of the shell itself, but just the fact that the former uses
different arguments to the open() system call when opening
a file for appending from what the latter uses.  That being
the case, it was easy to create a program which exhibits the
(mis)behaviour (or lack of it), totally taking the shell out
of the picture.

Appended at the end of this message is a short C program
which illustrates the problem.  It should output a line
of 60 'x' characters, then a line with 20 'y', and then
one with five 'z'.

When compiled with "-DBAD", it will use the open() args
which sh uses when appending; they're certainly correct.
Compiled without "-DBAD', it will use the same arguments
as csh uses.

(If csh's use of O_TRUNC seems odd for an "append", it's
because whereas sh does a one-step open-for-appending by
using O_WRONLY|O_APPEND|O_CREAT, csh first tries opening
an existing file with O_WRONLY|O_APPEND, and then if that
fails (because the file doesn't exist), it creates a new
file using O_WRONLY|O_CREAT|O_TRUNC.  One can argue that
the O_TRUNC is pointless if you're creating a _new_ file,
and I wouldn't disagree; but then, I've never thought of
csh as having been particularly well written ...)

So here we compile the program (both the "good" and "bad"
versions), then run each of them twice (once creating the
output file in /tmp, then in /pvfs/scratch/test).  After
all that, we look at the results; only the output of the
"bad" executable in /pvfs/scratch/test is different from
what it should be ...

SHC-TEST> cc -o good prog.c
SHC-TEST> cc -DBAD -o bad prog.c
SHC-TEST>
SHC-TEST> rm -f /tmp/good.out /tmp/bad.out /pvfs/scratch/test/good.out 
/pvfs/scratch/test/bad.out
SHC-TEST> good /tmp/good.out
SHC-TEST> bad /tmp/bad.out
SHC-TEST> good /pvfs/scratch/test/good.out
SHC-TEST> bad /pvfs/scratch/test/bad.out
SHC-TEST>
SHC-TEST> cat /tmp/good.out
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyy
zzzzz
SHC-TEST>
SHC-TEST> cat /tmp/bad.out
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyy
zzzzz
SHC-TEST>
SHC-TEST> cat /pvfs/scratch/test/good.out
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyy
zzzzz
SHC-TEST>
SHC-TEST> cat /pvfs/scratch/test/bad.out
zzzzz
yyyyyyyyyyyyyy
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

So in this last one, it appears that the file pointer gets
set back to the beginning of the file on each write() call
(but only if the file attached to that file descriptor is
in the /pvfs hierarchy).

However ...  Although this problem is solidly reproducible
on systems running the 2.6.5-7.201-smp kernel, it _doesn't_
happen on systems running the 2.6.13-15.7-smp kernel.

So I'm a bit baffled:  This certainly seems like a PVFS bug
(since the problem happens only when the file being opened
in "append" mode is on a PVFS filesystem; everything is OK
on locally-mounted filesystems and NFS-mounted filesystems).

But if it's a PVFS bug, then why does the problem disappear
with the newer kernel?  Are there two independent bugs (one
each in the 2.6.5-7.201-smp kernel and in the PVFS2 client
code), _both_ of which need to be present for this problem
to happen?

In any case, it seemed worth reporting, since although it's
likely we can avoid the problem by using that newer kernel,
I'd feel more comfortable if the PVFS-side bug were found
(and squashed), lest it come back to haunt us again in the
future.

I'd be curious to learn whether anyone else can reproduce
the problem with the C program below on whatever kernel(s)
they're using.

Thanks in advance ...

=============================================

/*
 *  Program to demonstrate PVFS bug; output to files
 *  opened with O_CREAT|O_WRONLY|O_APPEND misbehaves;
 *  the file pointer appears to be reset to the start
 *  of the file before every write() operation.  (But
 *  the problem exists only with certain kernels!)
 *
 *  Takes exactly one command line argument (name
 *  of the file to which data get written).
 */

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define LONG_x  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
#define MED_y   "yyyyyyyyyyyyyyyyyyyy\n"
#define SHORT_z "zzzzz\n"
#define ERRMSG  "Eh?!?\n"

#ifdef BAD
#define OFLAGS  O_CREAT|O_WRONLY|O_APPEND
#else
#define OFLAGS  O_CREAT|O_WRONLY|O_TRUNC
#endif

#define MODE    S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH

main(argc,argv)
int     argc;
char ** argv;
{
        int     fd;

        if ( argc != 2 || (fd=open(argv[1],OFLAGS,MODE)) < 0 ) {
                write(2,ERRMSG,strlen(ERRMSG));
                exit(1);
        }
        write(fd,LONG_x,strlen(LONG_x));
        write(fd,MED_y,strlen(MED_y));
        write(fd,SHORT_z,strlen(SHORT_z));
        close(fd);
        exit(0);
}
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to