Re: [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme

2014-10-14 Thread Bob Peterson
- Original Message -
  This patch introduces a new block reservation doubling scheme. If we
  Maybe I sent this patch out prematurely. Instead of doubling the
  reservation, maybe I should experiment with making it grow additively.
  IOW, Instead of 32-64-128-256-512, I should use:
  32-64-96-128-160-192-224-etc...
  I know other file systems using doubling schemes, but I'm concerned
  about it being too aggressive.
  I tried an additive reservations algorithm. I basically changed the
  previous patch from doubling the reservation to adding 32 blocks.
  In other words, I replaced:
 
  +   ip-i_rsrv_minblks = 1;
  with this:
  +   ip-i_rsrv_minblks += RGRP_RSRV_MINBLKS;
 
  The results were not as good, but still very impressive, and maybe
  acceptable:
(snip)
 I think you are very much along the right lines. The issue is to ensure
 that all the evidence that is available is taken into account in
 figuring out how large a reservation to make. There are various clues,
 such as the time between writes, the size of the writes, whether the
 file gets closed between writes, whether the writes are contiguous and
 so forth.
 
 Some of those things are taken into account already, however we can
 probably do better. We may be able to also take some hints from things
 like calls to fsync (should we drop reservations that are small at this
 point, since it likely signifies a significant point in the file, if
 fsync is called?) or even detect well known non-linear write patterns,
 e.g. backwards stride patterns or large matrix access patterns (by row
 or column).
 
 The struct file is really the best place to store this context
 information, since if there are multiple writers to the same inode, then
 there is a fair chance that they'll have separate struct files. Does
 this happen in your test workload?
 
 The readahead code can already detect some common read patterns, and it
 also turns itself off if the reads are random. The readahead problem is
 actually very much the same problem in that it tries to estimate which
 reads are coming next based on the context that has been seen already,
 so there may well be some lessons to be learned from that too.
 
 I think its important to look at the statistics of lots of different
 workloads, and to check them off against your candidate algorithm(s), to
 ensure that the widest range of potential access patterns are taken into
 account,
 
 Steve.

Hi Steve,

Sorry it's taken me a bit to respond. I've been giving this a lot of thought
and doing a lot of experiments and tests.

I see multiple issues/problems, and my patches have been trying to address
them or solve them separately. You make some very good points here, so
I want to address them individually in the light of my latest findings.
I basically see three main performance problems:

1. Inter-node contention for resource groups. In the past, it was solved
   with try locks that ended with a chaos of block assignments.
   In RHEL7 and up, we eliminated them, but the contention came back and
   performance suffers. I posted a patch for this issue that allows each
   node in the cluster to prefer a unique set of resource groups. It
   improved reduced inter-node contention greatly and improved performance
   greatly. It was called GFS2: Set of distributed preferences for rgrps
   posted on October 8.
2. We need to more accurately predict the size of multi-block reservations.
   This is the issue you talk about here, and so far it's one that I
   haven't addressed yet.
3. We need a way to adjust those predictions if they're found to be
   inadequate. That's the problem I was addressing with the reservation
   doubling scheme or additive reservation scheme.

Issues 2 and 3 might possibly be treated as one issue: we could have a
self-adjusting reservation size system, based on a number of factors,
and I'm in the process of reworking how we do it. I've been doing lots of
experiments and running lots of tests against different workloads. You're
right that #2 is necessary, and I've verified that without it, some
workloads get faster while others get slower (although there's an overall
improvement).

Here are some thoughts:

1. Today, reservations are based on write size, which as you say, is
   not a very good predictor. We can do better.
2. My reservation doubling scheme helps, and reduces fragmentation, but
   we need a more sophisticated scheme.
3. I don't think the time between writes should affect the reservation
   because different applications have different dynamics.
4. Size of the writes are already taken into account. However, the way
   we do it now is kind of bogus. With every write, we adjust the size
   hint. But if the application is doing rewrites, it shouldn't matter.
   If it's writing backwards or at random locations, it might matter.
   Last night I experimented with a new scheme that basically only
   adjusts the size hint if block allocations are 

[Cluster-devel] [GFS2 PATCH] GFS2: Speed up fiemap function by skipping holes

2014-10-14 Thread Bob Peterson
Hi,

This patch detects the new holesize bit in block_map requests. If a
hole is found during fiemap, it figures out the size of the hole
based on the current metapath information. Since the metapath only
represents a section of the file, it can only extrapolate to a
certain size based on the current metapath buffers. Therefore,
fiemap may call blockmap several times to get the hole size.
The hole size is determined by a new function.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com 
---
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index f0b945a..edd7ed6 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -587,6 +587,62 @@ static int gfs2_bmap_alloc(struct inode *inode, const 
sector_t lblock,
 }
 
 /**
+ * hole_size - figure out the size of a hole
+ * @ip: The inode
+ * @lblock: The logical starting block number
+ * @mp: The metapath
+ *
+ * Returns: The hole size in bytes
+ *
+ */
+static u64 hole_size(struct inode *inode, sector_t lblock,
+struct metapath *mp)
+{
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+   unsigned int end_of_metadata = ip-i_height - 1;
+   u64 factor = 1;
+   int hgt = end_of_metadata;
+   u64 holesz = 0, holestep;
+   const __be64 *first, *end, *ptr;
+   const struct buffer_head *bh;
+   u64 isize = i_size_read(inode);
+   int zeroptrs;
+   struct metapath mp_eof;
+
+   /* Get a metapath to the very last byte */
+   find_metapath(sdp, (isize - 1)  inode-i_blkbits, mp_eof,
+ ip-i_height);
+   for (hgt = end_of_metadata; hgt = 0; hgt--) {
+   bh = mp-mp_bh[hgt];
+   if (bh) {
+   zeroptrs = 0;
+   first = metapointer(hgt, mp);
+   end = (const __be64 *)(bh-b_data + bh-b_size);
+
+   for (ptr = first; ptr  end; ptr++) {
+   if (*ptr)
+   break;
+   else
+   zeroptrs++;
+   }
+   } else {
+   zeroptrs = sdp-sd_inptrs;
+   }
+   holestep = min(factor * zeroptrs,
+  isize - (lblock + (zeroptrs * holesz)));
+   holesz += holestep;
+   if (lblock + holesz = isize)
+   return holesz  inode-i_blkbits;
+
+   factor *= sdp-sd_inptrs;
+   if (hgt  (mp-mp_list[hgt - 1]  mp_eof.mp_list[hgt - 1]))
+   (mp-mp_list[hgt - 1])++;
+   }
+   return holesz  inode-i_blkbits;
+}
+
+/**
  * gfs2_block_map - Map a block from an inode to a disk block
  * @inode: The inode
  * @lblock: The logical block number
@@ -645,11 +701,17 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
ret = lookup_metapath(ip, mp);
if (ret  0)
goto out;
-   if (ret != ip-i_height)
+   if (ret != ip-i_height) {
+   if (test_clear_buffer_holesize(bh_map))
+   bh_map-b_size = hole_size(inode, lblock, mp);
goto do_alloc;
+   }
ptr = metapointer(ip-i_height - 1, mp);
-   if (*ptr == 0)
+   if (*ptr == 0) {
+   if (test_clear_buffer_holesize(bh_map))
+   bh_map-b_size = hole_size(inode, lblock, mp);
goto do_alloc;
+   }
map_bh(bh_map, inode-i_sb, be64_to_cpu(*ptr));
bh = mp.mp_bh[ip-i_height - 1];
len = gfs2_extent_length(bh-b_data, bh-b_size, ptr, maxlen, eob);