Re: [Ocfs2-devel] [PATCH] Track negative dentries

2010-06-22 Thread Wengang Wang
On 10-06-21 22:38, Sunil Mushran wrote:
 Are you referring to the dentry lock? We encode the parent inode in
 the lock. But the difference is that a dentry having that lock
 cannot be a negative dentry.

Yes, I know currently the dentry-d_fsdata is used for dentry lock.
In my patch, I make it a union of dentry lock and the parent ino.
So that I can set the parent ino to a negative dentry.

But as I said, the dentry is unhashed because it's negative in
do_lookup(). So the fsdata, the parent ino, is lost. Thus when a second
such ls comes, we got a new dentry with fsdata being NULL. so we have to
hit disk(ocfs2_lookup).

regards,
wengang.

 
 On Jun 21, 2010, at 10:22 PM, Wengang Wang
 wen.gang.w...@oracle.com wrote:
 
 On 10-06-21 20:36, Joel Becker wrote:
 On Tue, Jun 22, 2010 at 11:23:33AM +0800, Wengang Wang wrote:
 Actually I meant two dentries in the two run of 'ls -l'.
 At the first run, a dentry, dentry A, is created. Because
 fileA doesn't exist,
 dentry-d_inode is NULL. Then a do_revalidate() is run,
 ocfs2_dentry_revalidate() returns not valid since d_inode is NULL.
 Thus the dentry A is unhashed from cache by d_invalidate().
 At the second run of 'ls -l', since dentry A is unhashed,
 there is no dentry for
 fileA exist in dentry hash, a new dentry, dentry B, is
 created. The new dentry B
 don't have any info about parent ino.
 
 I found that when testing my patch for the track negative
 dentries.
 Maybe I misunderstand something?
 
When that dentry gets linked into the tree, it will point to its
 d_parent.  So the parent inode is dentryB-d_parent-d_inode.
 
 Yes Joel, I know that link.
 I meant there is no parent inode number stored on on dentryB it's
 self,
 thus no way to compare it with dentryB-d_parent-d_inode.
 We set the parent inode number info to dentryA(somewhere in d_fsdata).
 But it's lost at the second 'ls -l' because dentryB is got instead
 of dentryA.
 
 regards,
 wengang.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Track negative dentries

2010-06-22 Thread Sunil Mushran
Are you referring to the dentry lock? We encode the parent inode in  
the lock. But the difference is that a dentry having that lock cannot  
be a negative dentry.

On Jun 21, 2010, at 10:22 PM, Wengang Wang wen.gang.w...@oracle.com  
wrote:

 On 10-06-21 20:36, Joel Becker wrote:
 On Tue, Jun 22, 2010 at 11:23:33AM +0800, Wengang Wang wrote:
 Actually I meant two dentries in the two run of 'ls -l'.
 At the first run, a dentry, dentry A, is created. Because fileA  
 doesn't exist,
 dentry-d_inode is NULL. Then a do_revalidate() is run,
 ocfs2_dentry_revalidate() returns not valid since d_inode is NULL.
 Thus the dentry A is unhashed from cache by d_invalidate().
 At the second run of 'ls -l', since dentry A is unhashed, there is  
 no dentry for
 fileA exist in dentry hash, a new dentry, dentry B, is created.  
 The new dentry B
 don't have any info about parent ino.

 I found that when testing my patch for the track negative  
 dentries.
 Maybe I misunderstand something?

When that dentry gets linked into the tree, it will point to its
 d_parent.  So the parent inode is dentryB-d_parent-d_inode.

 Yes Joel, I know that link.
 I meant there is no parent inode number stored on on dentryB it's  
 self,
 thus no way to compare it with dentryB-d_parent-d_inode.
 We set the parent inode number info to dentryA(somewhere in d_fsdata).
 But it's lost at the second 'ls -l' because dentryB is got instead  
 of dentryA.

 regards,
 wengang.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Track negative dentries

2010-06-22 Thread Mark Fasheh
On Tue, Jun 22, 2010 at 11:23:33AM +0800, Wengang Wang wrote:
 Actually I meant two dentries in the two run of 'ls -l'.
 At the first run, a dentry, dentry A, is created. Because fileA doesn't exist,
 dentry-d_inode is NULL.

Ok so you do a lookup on a name which doesn't exist, which results in a
negative dentry for fileA.

 Then a do_revalidate() is run, ocfs2_dentry_revalidate() returns not
 valid since d_inode is NULL. Thus the dentry A is unhashed from cache by
 d_invalidate().

Part of the patch is to change this behavior right here.

ocfs2_dentry_revalidate() is seeing the NULL inode and forcing the dentry to
be invalidated (thus requiring another lookup).


The patch needs to handle this by comparing sequence numbers with the parent
dentry and ONLY if they're different should it force the dentry to be
invalidated. Otherwise it can allow the dentry to stay hashed (and
negative).
--Mark

--
Mark Fasheh

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] Fwd: [PATCH 0/2] Fix aio completion vs unwritten extents

2010-06-22 Thread Sunil Mushran

Tristan,

The reflink test is missing aio bits. Please could you add it.
In short, a pread issued after an aio write, must return that
data. This should be tested for both direct and buffered ios.

Similarly, fill_holes also needs to be enhanced.

Thanks
Sunil

 Original Message 
Subject:[PATCH 0/2] Fix aio completion vs unwritten extents
Date:   Tue, 22 Jun 2010 08:21:44 -0400
From:   Christoph Hellwig h...@infradead.org
To: 	linux-fsde...@vger.kernel.org, x...@oss.sgi.com, 
linux-e...@vger.kernel.org




Some filesystems (XFS and ext4) have support for a concept called
unwritten extents, where we can write data into holes / preallocated
space and only mark them as allocated when the data I/O has finished.

Because the transaction to convert the extent can't be submitted from
I/O completion, which normally happens from IRQ context it needs to
be defered to a workqueue.  This is not a problem for buffered I/O
where we keep the data in cache at least until the I/O operation has
finished, but it is an issue for direct I/O.  XFS avoids that problem
for synchronous direct I/O by waiting for all unwritten extent conversions
to finish if we did one during direct I/O, but so far has ignored the
problem for asynchronous I/O.  Unfortunately the race is very easy
to hit by using QEMU with native AIO support on a sparse image, and
the result is filesystem corruption in the guest.

This contains core direct I/O changes to allow the filesystem to delay
AIO completion, as well as a patch to fix XFS.  ext4 also has the same
issue, and from a quick look also doesn't properly complete unwritten
extent conversions for synchronous direct I/O, but I'll leave that
for someone more familar to figure out.

Below is a minimal reproducer for the issue.  Given that we're dealing
with a race condition it doesn't always fail, but in 2 core laptop
it triggers 100% reproducibly in 20 runs in a loop.

---

#define _GNU_SOURCE

#includesys/stat.h
#includesys/types.h
#includeerrno.h
#includefcntl.h
#includestdio.h
#includestdlib.h
#includeunistd.h

#includelibaio.h

#define BUF_SIZE4096
#define IO_PATTERN  0xab

int main(int argc, char *argv[])
{
struct io_context *ctx = NULL;
struct io_event ev;
struct iocb iocb, *iocbs[] = {iocb };
void *buf;
char cmp_buf[BUF_SIZE];
int fd, err = 0;

fd = open(argv[1], O_DIRECT | O_CREAT | O_TRUNC | O_RDWR, 0600);
if (fd == -1) {
perror(open);
return 1;
}

err = posix_memalign(buf, BUF_SIZE, BUF_SIZE);
if (err) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
posix_memalign);
return 1;
}
memset(buf, IO_PATTERN, BUF_SIZE);
memset(cmp_buf, IO_PATTERN, BUF_SIZE);

/*
 * Truncate to some random large file size.  Just make sure
 * it's not smaller than our I/O size.
 */
if (ftruncate(fd, 1024 * 1024 * 1024)  0) {
perror(ftruncate);
return 1;
}


/*
 * Do a simple 4k write into a hole using aio.
 */
err = io_setup(1,ctx);
if (err) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_setup);
return 1;
}

io_prep_pwrite(iocb, fd, buf, BUF_SIZE, 0);

err = io_submit(ctx, 1, iocbs);
if (err != 1) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_submit);
return 1;
}

err = io_getevents(ctx, 1, 1,ev, NULL);
if (err != 1) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_getevents);
return 1;
}

/*
 * And then read it back.
 *
 * Using pread to keep it simple, but AIO has the same effect.
 */
if (pread(fd, buf, BUF_SIZE, 0) != BUF_SIZE) {
perror(pread);
return 1;
}

/*
 * And depending on the machine we'll just get zeroes back quite
 * often here.  That's because the unwritten extent conversion
 * hasn't finished.
 */
if (memcmp(buf, cmp_buf, BUF_SIZE)) {
unsigned long long *ubuf = (unsigned long long *)buf;
int i;

for (i = 0; i  BUF_SIZE / sizeof(unsigned long long); i++)
printf(%d: 0x%llx\n, i, ubuf[i]);

return 1;
}

return 0;
}
--
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-22 Thread Srinivas Eeda
There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   76 --
 1 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..cb74689 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,26 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
if (!master)
res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +189,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
 
-   __dlm_unhash_lockres(res);
+   if (__dlm_lockres_unused(res))
+   __dlm_unhash_lockres(res);
+   else {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +236,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock);
-   unused = __dlm_lockres_unused(lockres);
-   

Re: [Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-22 Thread Sunil Mushran
Comments inlined. Getting there. Some more cleanups.

On 06/22/2010 10:33 AM, Srinivas Eeda wrote:
 There are two problems in dlm_run_purgelist

 1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
 the same lockres instead of trying the next lockres.

 2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
 before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
 spinlock is reacquired but in this window lockres can get reused. This leads
 to BUG.

 This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
   next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
 lockres spinlock protecting it from getting reused.

 Signed-off-by: Srinivas Eedasrinivas.e...@oracle.com
 ---
   fs/ocfs2/dlm/dlmthread.c |   76 
 --
   1 files changed, 33 insertions(+), 43 deletions(-)

 diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
 index 11a6d1f..cb74689 100644
 --- a/fs/ocfs2/dlm/dlmthread.c
 +++ b/fs/ocfs2/dlm/dlmthread.c
 @@ -152,45 +152,26 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
   spin_unlock(dlm-spinlock);
   }

 -static int dlm_purge_lockres(struct dlm_ctxt *dlm,
 +static void dlm_purge_lockres(struct dlm_ctxt *dlm,
struct dlm_lock_resource *res)
   {
   int master;
   int ret = 0;

 - spin_lock(res-spinlock);
 - if (!__dlm_lockres_unused(res)) {
 - mlog(0, %s:%.*s: tried to purge but not unused\n,
 -  dlm-name, res-lockname.len, res-lockname.name);
 - __dlm_print_one_lock_resource(res);
 - spin_unlock(res-spinlock);
 - BUG();
 - }
 -
 - if (res-state  DLM_LOCK_RES_MIGRATING) {
 - mlog(0, %s:%.*s: Delay dropref as this lockres is 
 -  being remastered\n, dlm-name, res-lockname.len,
 -  res-lockname.name);
 - /* Re-add the lockres to the end of the purge list */
 - if (!list_empty(res-purge)) {
 - list_del_init(res-purge);
 - list_add_tail(res-purge,dlm-purge_list);
 - }
 - spin_unlock(res-spinlock);
 - return 0;
 - }
 + assert_spin_locked(dlm-spinlock);
 + assert_spin_locked(res-spinlock);

   master = (res-owner == dlm-node_num);

   if (!master)
   res-state |= DLM_LOCK_RES_DROPPING_REF;
 - spin_unlock(res-spinlock);

   mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
res-lockname.name, master);

   if (!master) {
   /* drop spinlock...  retake below */
 + spin_unlock(res-spinlock);
   spin_unlock(dlm-spinlock);

   spin_lock(res-spinlock);


This can do with more cleanup.

+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);

master = (res-owner == dlm-node_num);

-   if (!master)
-   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);

mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);

if (!master) {
+   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   /* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);

spin_lock(res-spinlock);


 @@ -208,31 +189,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
   mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
dlm-name, res-lockname.len, res-lockname.name, ret);
   spin_lock(dlm-spinlock);
 + spin_lock(res-spinlock);
   }

 - spin_lock(res-spinlock);
   if (!list_empty(res-purge)) {
   mlog(0, removing lockres %.*s:%p from purgelist, 
master = %d\n, res-lockname.len, res-lockname.name,
res, master);
   list_del_init(res-purge);
 - spin_unlock(res-spinlock);
   dlm_lockres_put(res);
   dlm-purge_count--;
 - } else
 - spin_unlock(res-spinlock);
 + }

 - __dlm_unhash_lockres(res);
 + if (__dlm_lockres_unused(res))
 + __dlm_unhash_lockres(res);
 + else {
 + mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
 +  dlm-name, res-lockname.len, res-lockname.name);
 + __dlm_print_one_lock_resource(res);
 + BUG();
 + }


Blocks that BUG should be separated if possible. Improves readability.

+   if (!__dlm_lockres_unused(res)) {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }

__dlm_unhash_lockres(res);




[Ocfs2-devel] Large ( 16TiB) volumes revisited

2010-06-22 Thread Patrick J. LoPresti
I have just submitted the following bug report:

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1266

This is formally reporting the issue originally identified (and fixed)
by Robert Smith back in December:

http://www.mail-archive.com/ocfs2-devel@oss.oracle.com/msg04728.html

Specifically, even the latest OCFS2 produces an error when you attempt
to mount a volume larger than 16 TiB:

ocfs2_initialize_super:2157 ERROR: Volume might try to write to
blocks beyond what jbd can address in 32 bits.

I would like to use large volumes in production later this year or
early next, so I am interested in seeing this issue resolved so I can
begin testing.  I believe this check in fs/ocfs2/super.c is the only
known issue standing in the way of large volume support for OCFS2.  I
want to submit a patch to fix it.

The simplest approach is just to delete the check, like so:

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0eaa929..0ba41f3 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2215,14 +2215,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
goto bail;
}

-   if (ocfs2_clusters_to_blocks(osb-sb, le32_to_cpu(di-i_clusters) - 1)
-(u32)~0UL) {
-   mlog(ML_ERROR, Volume might try to write to blocks beyond 
-what jbd can address in 32 bits.\n);
-   status = -EINVAL;
-   goto bail;
-   }
-
if (ocfs2_setup_osb_uuid(osb, di-id2.i_super.s_uuid,
 sizeof(di-id2.i_super.s_uuid))) {
mlog(ML_ERROR, Out of memory trying to setup our uuid.\n);


Questions for the list:

1) Is this patch sufficient?  Or should I try to modify the check to
take into account the cluster size?  Anything else I need to check
here (e.g. inode64 mount option)?

2) Should mkfs.ocfs2 contain a similar check?  (It may already; I have
not looked yet...)

 - Pat

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Large ( 16TiB) volumes revisited

2010-06-22 Thread Joel Becker
On Tue, Jun 22, 2010 at 05:11:50PM -0700, Patrick J. LoPresti wrote:
 The simplest approach is just to delete the check, like so:
 
 diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
 index 0eaa929..0ba41f3 100644
 --- a/fs/ocfs2/super.c
 +++ b/fs/ocfs2/super.c
 @@ -2215,14 +2215,6 @@ static int ocfs2_initialize_super(struct super_block 
 *sb,
 goto bail;
 }
 
 -   if (ocfs2_clusters_to_blocks(osb-sb, le32_to_cpu(di-i_clusters) - 1)
 -(u32)~0UL) {
 -   mlog(ML_ERROR, Volume might try to write to blocks beyond 
 -what jbd can address in 32 bits.\n);
 -   status = -EINVAL;
 -   goto bail;
 -   }
 -
 if (ocfs2_setup_osb_uuid(osb, di-id2.i_super.s_uuid,
  sizeof(di-id2.i_super.s_uuid))) {
 mlog(ML_ERROR, Out of memory trying to setup our uuid.\n);
 
 
 Questions for the list:
 
 1) Is this patch sufficient?  Or should I try to modify the check to
 take into account the cluster size?  Anything else I need to check
 here (e.g. inode64 mount option)?

This patch is not sufficient.  The journal options need to be
checked, as does the block addressing space of the kernel.

Joel

-- 

Gone to plant a weeping willow
 On the bank's green edge it will roll, roll, roll.
 Sing a lulaby beside the waters.
 Lovers come and go, the river roll, roll, rolls.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Large ( 16TiB) volumes revisited

2010-06-22 Thread Patrick J. LoPresti
On Tue, Jun 22, 2010 at 5:49 PM, Joel Becker joel.bec...@oracle.com wrote:
 On Tue, Jun 22, 2010 at 05:11:50PM -0700, Patrick J. LoPresti wrote:

        This patch is not sufficient.  The journal options need to be
 checked, as does the block addressing space of the kernel.

OK.  But I am new to this code base and would appreciate some pointers...

Specifically:

Which journal options?  Are they visible at this point in super.c?

By block addressing space of the kernel, are we just talking
sizeof(some_kernel_type) or something more involved?

Thanks!

 - Pat

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Large ( 16TiB) volumes revisited

2010-06-22 Thread Joel Becker
On Tue, Jun 22, 2010 at 06:12:01PM -0700, Patrick J. LoPresti wrote:
 On Tue, Jun 22, 2010 at 5:49 PM, Joel Becker joel.bec...@oracle.com wrote:
  On Tue, Jun 22, 2010 at 05:11:50PM -0700, Patrick J. LoPresti wrote:
 
         This patch is not sufficient.  The journal options need to be
  checked, as does the block addressing space of the kernel.
 
 OK.  But I am new to this code base and would appreciate some pointers...
 
 Specifically:
 
 Which journal options?  Are they visible at this point in super.c?

JBD2_FEATURE_INCOMPAT_64BIT.  You'll need to call
jbd2_journal_check_used_features() to see if the journal has it enabled.
Before you can do that, you have to ensure that the filesystem has the
OCFS2_FEATURE_COMPAT_JBD2_SB feature enabled.

 By block addressing space of the kernel, are we just talking
 sizeof(some_kernel_type) or something more involved?

sector_t.  You can copy what ext4 did here.

2709 if ((ext4_blocks_count(es) 
2710  (sector_t)(~0ULL)  (sb-s_blocksize_bits - 9)) ||
2711 (ext4_blocks_count(es) 
2712  (pgoff_t)(~0ULL)  (PAGE_CACHE_SHIFT - 
sb-s_blocksize_bits)) ) {
2713 ext4_msg(sb, KERN_ERR, filesystem
2714   too large to mount safely on this system);

blocks_count is, of course, the calculation we already perform
to turn di-i_clusters into how many blocks we have.
We must always fail if we can't address our disk in sector_t.
Once we've gotten past that check, we then need to figure out if our
journal can address our entire disk.  So if our current test for 
(u32)~0UL returns true, one must then check for the 64BIT feature.

Joel

-- 

You don't make the poor richer by making the rich poorer.
- Sir Winston Churchill

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] Fwd: [PATCH 0/2] Fix aio completion vs unwritten extents

2010-06-22 Thread tristan
Sunil Mushran wrote:
 Tristan,

 The reflink test is missing aio bits. Please could you add it.
 In short, a pread issued after an aio write, must return that
 data. This should be tested for both direct and buffered ios.

 Similarly, fill_holes also needs to be enhanced.

Gotta, verifications for odirect and buffered ios have already been 
included in current tests I guess.

What's more, maybe we can also verify the data from aio_read() after 
pwrite() completed.

Tristan.


 Thanks
 Sunil

  Original Message 
 Subject:  [PATCH 0/2] Fix aio completion vs unwritten extents
 Date: Tue, 22 Jun 2010 08:21:44 -0400
 From: Christoph Hellwig h...@infradead.org
 To:   linux-fsde...@vger.kernel.org, x...@oss.sgi.com, 
 linux-e...@vger.kernel.org



 Some filesystems (XFS and ext4) have support for a concept called
 unwritten extents, where we can write data into holes / preallocated
 space and only mark them as allocated when the data I/O has finished.

 Because the transaction to convert the extent can't be submitted from
 I/O completion, which normally happens from IRQ context it needs to
 be defered to a workqueue.  This is not a problem for buffered I/O
 where we keep the data in cache at least until the I/O operation has
 finished, but it is an issue for direct I/O.  XFS avoids that problem
 for synchronous direct I/O by waiting for all unwritten extent conversions
 to finish if we did one during direct I/O, but so far has ignored the
 problem for asynchronous I/O.  Unfortunately the race is very easy
 to hit by using QEMU with native AIO support on a sparse image, and
 the result is filesystem corruption in the guest.

 This contains core direct I/O changes to allow the filesystem to delay
 AIO completion, as well as a patch to fix XFS.  ext4 also has the same
 issue, and from a quick look also doesn't properly complete unwritten
 extent conversions for synchronous direct I/O, but I'll leave that
 for someone more familar to figure out.

 Below is a minimal reproducer for the issue.  Given that we're dealing
 with a race condition it doesn't always fail, but in 2 core laptop
 it triggers 100% reproducibly in 20 runs in a loop.

 ---

 #define _GNU_SOURCE

 #include sys/stat.h
 #include sys/types.h
 #include errno.h
 #include fcntl.h
 #include stdio.h
 #include stdlib.h
 #include unistd.h

 #include libaio.h

 #define BUF_SIZE  4096
 #define IO_PATTERN0xab

 int main(int argc, char *argv[])
 {
   struct io_context *ctx = NULL;
   struct io_event ev;
   struct iocb iocb, *iocbs[] = { iocb };
   void *buf;
   char cmp_buf[BUF_SIZE];
   int fd, err = 0;

   fd = open(argv[1], O_DIRECT | O_CREAT | O_TRUNC | O_RDWR, 0600);
   if (fd == -1) {
   perror(open);
   return 1;
   }

   err = posix_memalign(buf, BUF_SIZE, BUF_SIZE);
   if (err) {
   fprintf(stderr, error %s during %s\n,
   strerror(-err),
   posix_memalign);
   return 1;
   }
   memset(buf, IO_PATTERN, BUF_SIZE);
   memset(cmp_buf, IO_PATTERN, BUF_SIZE);

   /*
* Truncate to some random large file size.  Just make sure
* it's not smaller than our I/O size.
*/
   if (ftruncate(fd, 1024 * 1024 * 1024)  0) {
   perror(ftruncate);
   return 1;
   }


   /*
* Do a simple 4k write into a hole using aio.
*/
   err = io_setup(1, ctx);
   if (err) {
   fprintf(stderr, error %s during %s\n,
   strerror(-err),
   io_setup);
   return 1;
   }

   io_prep_pwrite(iocb, fd, buf, BUF_SIZE, 0);

   err = io_submit(ctx, 1, iocbs);
   if (err != 1) {
   fprintf(stderr, error %s during %s\n,
   strerror(-err),
   io_submit);
   return 1;
   }

   err = io_getevents(ctx, 1, 1, ev, NULL);
   if (err != 1) {
   fprintf(stderr, error %s during %s\n,
   strerror(-err),
   io_getevents);
   return 1;
   }

   /*
* And then read it back.
*
* Using pread to keep it simple, but AIO has the same effect.
*/
   if (pread(fd, buf, BUF_SIZE, 0) != BUF_SIZE) {
   perror(pread);
   return 1;
   }

   /*
* And depending on the machine we'll just get zeroes back quite
* often here.  That's because the unwritten extent conversion
* hasn't finished.
*/
   if (memcmp(buf, cmp_buf, BUF_SIZE)) {
   unsigned long long *ubuf = (unsigned long long *)buf;
   int i;

   for (i = 0; i  BUF_SIZE / sizeof(unsigned long long); i++)
   printf(%d: 0x%llx\n, i, ubuf[i]);
   
   return 1;
   }

   return 0;
 }
 --
 To 

Re: [Ocfs2-devel] [PATCH 0/2] Fix aio completion vs unwritten extents

2010-06-22 Thread Sunil Mushran
On Jun 22, 2010, at 6:50 PM, tristan tristan...@oracle.com wrote:

 Sunil Mushran wrote:
 Tristan,

 The reflink test is missing aio bits. Please could you add it.
 In short, a pread issued after an aio write, must return that
 data. This should be tested for both direct and buffered ios.

 Similarly, fill_holes also needs to be enhanced.

 Gotta, verifications for odirect and buffered ios have already been  
 included in current tests I guess.

I meant aio+direct and aio+buffered.


 What's more, maybe we can also verify the data from aio_read() after  
 pwrite() completed.

Not that interesting. As in, if that fails, then that would mean a  
pread() will also fail. And we already test that. The aio_write case  
is interesting because of the possible race with pread. aio_read after  
a pwrite cannot race each other.


 Tristan.


 Thanks
 Sunil

  Original Message 
 Subject:[PATCH 0/2] Fix aio completion vs unwritten extents
 Date:Tue, 22 Jun 2010 08:21:44 -0400
 From:Christoph Hellwig h...@infradead.org
 To:linux-fsde...@vger.kernel.org, x...@oss.sgi.com, 
 linux-e...@vger.kernel.org



 Some filesystems (XFS and ext4) have support for a concept called
 unwritten extents, where we can write data into holes / preallocated
 space and only mark them as allocated when the data I/O has finished.

 Because the transaction to convert the extent can't be submitted from
 I/O completion, which normally happens from IRQ context it needs to
 be defered to a workqueue.  This is not a problem for buffered I/O
 where we keep the data in cache at least until the I/O operation has
 finished, but it is an issue for direct I/O.  XFS avoids that problem
 for synchronous direct I/O by waiting for all unwritten extent  
 conversions
 to finish if we did one during direct I/O, but so far has ignored the
 problem for asynchronous I/O.  Unfortunately the race is very easy
 to hit by using QEMU with native AIO support on a sparse image, and
 the result is filesystem corruption in the guest.

 This contains core direct I/O changes to allow the filesystem to  
 delay
 AIO completion, as well as a patch to fix XFS.  ext4 also has the  
 same
 issue, and from a quick look also doesn't properly complete unwritten
 extent conversions for synchronous direct I/O, but I'll leave that
 for someone more familar to figure out.

 Below is a minimal reproducer for the issue.  Given that we're  
 dealing
 with a race condition it doesn't always fail, but in 2 core laptop
 it triggers 100% reproducibly in 20 runs in a loop.

 ---

 #define _GNU_SOURCE

 #include sys/stat.h
 #include sys/types.h
 #include errno.h
 #include fcntl.h
 #include stdio.h
 #include stdlib.h
 #include unistd.h

 #include libaio.h

 #define BUF_SIZE4096
 #define IO_PATTERN0xab

 int main(int argc, char *argv[])
 {
struct io_context *ctx = NULL;
struct io_event ev;
struct iocb iocb, *iocbs[] = { iocb };
void *buf;
char cmp_buf[BUF_SIZE];
int fd, err = 0;

fd = open(argv[1], O_DIRECT | O_CREAT | O_TRUNC | O_RDWR, 0600);
if (fd == -1) {
perror(open);
return 1;
}

err = posix_memalign(buf, BUF_SIZE, BUF_SIZE);
if (err) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
posix_memalign);
return 1;
}
memset(buf, IO_PATTERN, BUF_SIZE);
memset(cmp_buf, IO_PATTERN, BUF_SIZE);

/*
 * Truncate to some random large file size.  Just make sure
 * it's not smaller than our I/O size.
 */
if (ftruncate(fd, 1024 * 1024 * 1024)  0) {
perror(ftruncate);
return 1;
}


/*
 * Do a simple 4k write into a hole using aio.
 */
err = io_setup(1, ctx);
if (err) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_setup);
return 1;
}

io_prep_pwrite(iocb, fd, buf, BUF_SIZE, 0);

err = io_submit(ctx, 1, iocbs);
if (err != 1) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_submit);
return 1;
}

err = io_getevents(ctx, 1, 1, ev, NULL);
if (err != 1) {
fprintf(stderr, error %s during %s\n,
strerror(-err),
io_getevents);
return 1;
}

/*
 * And then read it back.
 *
 * Using pread to keep it simple, but AIO has the same effect.
 */
if (pread(fd, buf, BUF_SIZE, 0) != BUF_SIZE) {
perror(pread);
return 1;
}

/*
 * And depending on the machine we'll just get zeroes back quite
 * often here.  That's because the unwritten extent conversion
 * hasn't finished.
 */
if (memcmp(buf, cmp_buf, BUF_SIZE)) {
unsigned long long *ubuf = (unsigned long long *)buf;
int i;

for (i = 0; i  BUF_SIZE / sizeof(unsigned long long); i++)
printf(%d: 0x%llx\n, i, ubuf[i]);

return 1;
}

return 0;
 }
 --
 To unsubscribe from this list: 

Re: [Ocfs2-devel] [PATCH] Track negative dentries

2010-06-22 Thread Wengang Wang
Thanks Mark!

The patch works fine.
I went to a wrong direction.

regards,
wengang.
On 10-06-22 09:53, Mark Fasheh wrote:
 On Tue, Jun 22, 2010 at 11:23:33AM +0800, Wengang Wang wrote:
  Actually I meant two dentries in the two run of 'ls -l'.
  At the first run, a dentry, dentry A, is created. Because fileA doesn't 
  exist,
  dentry-d_inode is NULL.
 
 Ok so you do a lookup on a name which doesn't exist, which results in a
 negative dentry for fileA.
 
  Then a do_revalidate() is run, ocfs2_dentry_revalidate() returns not
  valid since d_inode is NULL. Thus the dentry A is unhashed from cache by
  d_invalidate().
 
 Part of the patch is to change this behavior right here.
 
 ocfs2_dentry_revalidate() is seeing the NULL inode and forcing the dentry to
 be invalidated (thus requiring another lookup).
 
 
 The patch needs to handle this by comparing sequence numbers with the parent
 dentry and ONLY if they're different should it force the dentry to be
 invalidated. Otherwise it can allow the dentry to stay hashed (and
 negative).
   --Mark
 
 --
 Mark Fasheh

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH 1/1] ocfs2 fix o2dlm dlm run purgelist(bug 9094491) - rev3

2010-06-22 Thread Srinivas Eeda
There are two problems in dlm_run_purgelist

1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.

2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.

This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
 next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.

Signed-off-by: Srinivas Eeda srinivas.e...@oracle.com
---
 fs/ocfs2/dlm/dlmthread.c |   79 +++--
 1 files changed, 33 insertions(+), 46 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index 11a6d1f..6822f9a 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -152,45 +152,25 @@ void dlm_lockres_calc_usage(struct dlm_ctxt *dlm,
spin_unlock(dlm-spinlock);
 }
 
-static int dlm_purge_lockres(struct dlm_ctxt *dlm,
+static void dlm_purge_lockres(struct dlm_ctxt *dlm,
 struct dlm_lock_resource *res)
 {
int master;
int ret = 0;
 
-   spin_lock(res-spinlock);
-   if (!__dlm_lockres_unused(res)) {
-   mlog(0, %s:%.*s: tried to purge but not unused\n,
-dlm-name, res-lockname.len, res-lockname.name);
-   __dlm_print_one_lock_resource(res);
-   spin_unlock(res-spinlock);
-   BUG();
-   }
-
-   if (res-state  DLM_LOCK_RES_MIGRATING) {
-   mlog(0, %s:%.*s: Delay dropref as this lockres is 
-being remastered\n, dlm-name, res-lockname.len,
-res-lockname.name);
-   /* Re-add the lockres to the end of the purge list */
-   if (!list_empty(res-purge)) {
-   list_del_init(res-purge);
-   list_add_tail(res-purge, dlm-purge_list);
-   }
-   spin_unlock(res-spinlock);
-   return 0;
-   }
+   assert_spin_locked(dlm-spinlock);
+   assert_spin_locked(res-spinlock);
 
master = (res-owner == dlm-node_num);
 
-   if (!master)
-   res-state |= DLM_LOCK_RES_DROPPING_REF;
-   spin_unlock(res-spinlock);
 
mlog(0, purging lockres %.*s, master = %d\n, res-lockname.len,
 res-lockname.name, master);
 
if (!master) {
+   res-state |= DLM_LOCK_RES_DROPPING_REF;
/* drop spinlock...  retake below */
+   spin_unlock(res-spinlock);
spin_unlock(dlm-spinlock);
 
spin_lock(res-spinlock);
@@ -208,31 +188,35 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
mlog(0, %s:%.*s: dlm_deref_lockres returned %d\n,
 dlm-name, res-lockname.len, res-lockname.name, ret);
spin_lock(dlm-spinlock);
+   spin_lock(res-spinlock);
}
 
-   spin_lock(res-spinlock);
if (!list_empty(res-purge)) {
mlog(0, removing lockres %.*s:%p from purgelist, 
 master = %d\n, res-lockname.len, res-lockname.name,
 res, master);
list_del_init(res-purge);
-   spin_unlock(res-spinlock);
dlm_lockres_put(res);
dlm-purge_count--;
-   } else
-   spin_unlock(res-spinlock);
+   }
+
+   if (!__dlm_lockres_unused) {
+   mlog(ML_ERROR, found lockres %s:%.*s: in use after deref\n,
+dlm-name, res-lockname.len, res-lockname.name);
+   __dlm_print_one_lock_resource(res);
+   BUG();
+   }
 
__dlm_unhash_lockres(res);
 
/* lockres is not in the hash now.  drop the flag and wake up
 * any processes waiting in dlm_get_lock_resource. */
if (!master) {
-   spin_lock(res-spinlock);
res-state = ~DLM_LOCK_RES_DROPPING_REF;
spin_unlock(res-spinlock);
wake_up(res-wq);
-   }
-   return 0;
+   } else
+   spin_unlock(res-spinlock);
 }
 
 static void dlm_run_purge_list(struct dlm_ctxt *dlm,
@@ -251,17 +235,7 @@ static void dlm_run_purge_list(struct dlm_ctxt *dlm,
lockres = list_entry(dlm-purge_list.next,
 struct dlm_lock_resource, purge);
 
-   /* Status of the lockres *might* change so double
-* check. If the lockres is unused, holding the dlm
-* spinlock will prevent people from getting and more
-* refs on it -- there's no need to keep the lockres
-* spinlock. */
spin_lock(lockres-spinlock);
-   unused = __dlm_lockres_unused(lockres);
-