Re: tiny e2fsprogs fix, bitmaps question

2007-04-18 Thread Theodore Tso
On Wed, Apr 18, 2007 at 12:54:44AM -0600, Andreas Dilger wrote: 

  [ I'm quoting out of order here, and cc'ing the linux-ext4 list with
permission since I think the topics under discussion have a more
general interest.  --Ted]

 Just reading the updated e2fsck.conf.5 man page, and noticed in [scratch 
 files]:
 
 numdirs_threshold:
 s/numbers of directory/number of directories/

Oops, thanks for catching that.  I implemented the in-core memory
reduction patches in somewhat of a hurry because I had a number of
users who had been using BackupPC or other similar hard-link intensive
backup programs, and they were running into massive memory usage
issues.  So this took priority over the extents refactorization work
(which is currently on the top of my e2fsprogs work queue).

 We are also looking to implement something better than raw bitmaps for
 cases where the set bits are expected to be sparse (e.g. block_dup_map,
 inode_bad_map, inode_bb_map, inode_imagic_map), instead of just going
 wholesale to on-disk storage (which is just going to slow things down).

The other type of bitmap implementation which I had been thinking
about asking people to implement is one which works well in the case
where the set bits are expected to be mostly be contiguous --- i.e.,
the block allocation map --- so some kind of extent-based data
structure indexed using an in-memory b-tree would be ideal.

Note that for block_dup_map, inode_bad_map, inode_bb_map,
inode_imagic_map, et. al., they are usually not allocated at all, and
if they are allocated, so while there are usually a very few numbers
of set bits, so using a tree-indexed, extent-based data structure
should work just fine for this sort of implementation.  Yes, it's not
as efficient as a array of integers, but it's much more efficient.

 The current API doesn't match that of the normal bitmap routines,
 but I think that is desirable.  What do you think?  The other
 thing that is lacking is an efficient and generic bitmap iterator
 instead of just walking [0 ... nbits], because in the sparse case
 the full-range walking can be grossly inefficient.

I agree it's desirable, and I had been planning a bitmap API revision
anyway.  The big changes I had been thinking for the new interface were:

* 64-bit enabled
* no inline functions
* pluggable back-ends (for multiple implementations, 
traditional, tree-based extents, disk-backed, etc.)
* extent-based set/clear functions (so you can take an extent map
from an inode and mark the entire extent as allocated in a 
block 
bitmap)
* To provide backwards ABI compatibility, if the magic number
in the first word of the bitmap indicates an old-style
bitmap, dispatch to the old inline-style bitmap operators

An iterator makes a lot of sense and I hadn't thought of it, but we
should definitely add it.  It might also be a good idea to add an
extent-based iterator, as well, since that would be even more CPU
efficient for some callers.

 Some things we are targetting with the design:
 - use less RAM for sparsely populated bitmaps
 - be not much worse than bitmaps if they turn out not to be sparse
 - avoid allocating gigantic or many tiny chunks of memory
 - be dynamic in chosing the method for storing set bits

Yep, all good things.  I hadn't really considered the requirement of
dynamically choosing a method, but that's because I figured the
tree-indexed extents data structure would hopefully be general purpose
enough to work with a very large range of filesystems, and dynamism
wasn't something I wanted to try to solve the first time around.

My current thinking favors a design like the io_manager, where you can
have one io_manager (test_io) provide services where the actual back
end work is done by another io_manager (i.e., unix_io).  So I could
imagine a auto bitmap type which automatically converts bitmap
representations behind the scenes from an in-memory to an on-disk
format, hopefully using a single in-memory format which is generally
applicable to most cases (such as tree-indexed extents), and then once
you go out to disk, it's all about correctness and completing the
task, and only secondarily about performance.  

But part of that is that while your ebitmap implementation has
desirable properties in terms of scaling from in-memory sparse arrays
to full bitmaps, I suspect a tree-indexed extents implementation has a
wider range of applicability, so I was assuming that we wouldn't have
to get to the dynamic switching part of the program for quite some 
time.  (IIRC, all xfs_repair has right now is a simple bit-count
compression scheme, and that seems to have been sufficient for them.)

BTW, If you're interested in implementing an extent-based b-tree,
which will be the next low-hanging fruit in terms of reducing
e2fsprogs's memory usage, not that we already have a red-black tree
implementations in 

Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-18 Thread Andrew Morton
 On Wed, 18 Apr 2007 15:54:00 +0200 Valerie Clement [EMAIL PROTECTED] wrote:
 
 Running benchmark tests (FFSB) on an ext4 filesystem, I noticed a 
 performance degradation (about 15-20 percent) in sequential write tests 
 between 2.6.19-rc6 and 2.6.21-rc4 kernels.
 
 I ran the same tests on ext3 and XFS filesystems and I saw the same 
 performance difference between the two kernel versions for these two 
 filesystems.
 
 I have also reproduced it between 2.6.20.7 and 2.6.21-rc7.
 The FFSB tests run 16 threads, each creating 1GB files. The tests were 
 done on the same x86_64 system, with the same kernel configuration and 
 on the same scsi device. Below are the throughput values given by FFSB.
 
kernel   XFSext3
 --
   2.6.20.748 MB/sec 44 MB/sec
 
   2.6.21-rc7  38 MB/sec 37 MB/sec
 
 Did anyone else run across the problem?
 Is there a known issue?
 

That's a new discovery, thanks.

It could be due to I/O scheduler changes.  Which one are you using?  CFQ?

Or it could be that there has been some changed behaviour at the VFS/pagecache
layer: the VFS might be submitting little hunks of lots of files, rather than
large hunks of few files.

Or it could be a block-layer thing: perhaps some driver change has caused
us to be placing less data into the queue.  Which device driver is that machine
using?

Being a simple soul, the first thing I'll try when I get near a test box
will be

for i in $(seq 1 16)
do
time dd if=/dev/zero of=$i bs=1M count=1024 
done
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 behavior on power failure

2007-04-18 Thread Bruno Wolff III
On Wed, Mar 28, 2007 at 09:17:27 -0400,
  John Anthony Kazos Jr. [EMAIL PROTECTED] wrote:
  If you fsync() your data, you are guaranteed that also your data are
 safely on disk when fsync returns. So what is the question here?
 
 Pardon a newbie's intrusion, but I do know this isn't true. There is a 
 window of possible loss because of the multitude of layers of caching, 
 especially within the drive itself. Unless there is a super_duper_fsync() 
 that is able to actually poll the hardware and get a confirmation that the 
 internal buffers are purged?

That is why you need to disable write caching of the drives or use cache
flushes via write barriers (if the stack of block devices all support them)
if the hardware cache isn't battery backed or the device doesn't support
returning the status of particular commands.

Of course nothing is perfectly safe.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  18:01 +1000, Timothy Shimmin wrote:
 --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] 
 wrote:
 struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping 
  */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to 
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
 }
 
 # define FIEMAP_LEN_MASK 0xff
 # define FIEMAP_LEN_HOLE 0x01
 # define FIEMAP_LEN_UNWRITTEN0x02
 
 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 The -fm_extents[] array includes all of the holes in addition to
 allocated extents because this avoids the need to return both the logical
 and physical address for every extent and does not make processing any
 harder.
 
 Well, that's what stood out for me. I was wondering where the fe_block 
 field had gone - the physical address.
 So is your fe_start; /* starting offset */ actually the disk location
 (not a logical file offset)
 _except_ in the header (fiemap) where it is the desired logical offset.

Correct.  The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region.  In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the virtual hole at the end of the file.  I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent.  fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

 Okay, looking at your example use below that's what it looks like.
 And when you refer to fm_start below, you mean fm_start.fe_start?
 Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

 So you get rid of all the logical file offsets in the extents because we
 report holes explicitly (and we know everything is contiguous if you
 include the holes).

Correct.  It saves space in the common case.

 Caller works something like:
 
  char buf[4096];
  struct fiemap *fm = (struct fiemap *)buf;
  int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
  
  fm-fm_start.fe_start = 0; /* start of file */
  fm-fm_start.fe_len = -1;   /* end of file */
  fm-fm_extent_count = count; /* max extents in fm_extents[] array */
  fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */
 
  fd = open(path, O_RDONLY);
  printf(logical\t\tphysical\t\tbytes\n);
 
  /* The last entry will have less extents than the maximum */
  while (fm-fm_extent_count == count) {
  rc = ioctl(fd, FIEMAP, fm);
  if (rc)
  break;
 
  /* kernel filled in fm_extents[] array, set fm_extent_count
   * to be actual number of extents returned, leaves
   * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
 
  for (i = 0; i  fm-fm_extent_count; i++) {
  __u64 len = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_MASK;
  __u64 fm_next = fm-fm_start.fe_start + len;
  int hole = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_HOLE;
  int unwr = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_UNWRITTEN;
 
  printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
  fm-fm_start.fe_start, fm_next - 1,
  hole ? 0 : fm-fm_extents[i].fe_start,
  hole ? 0 : fm-fm_extents[i].fe_start +
 fm-fm_extents[i].fe_len - 1,
  len, hole ? (hole)  : ,
  unwr ? (unwritten)  : );
 
  /* get ready for printing next extent, or next ioctl 
  */
  fm-fm_start.fe_start = fm_next;
  }
  }
 

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  21:22 +1000, David Chinner wrote:
 On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
  struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
  }
  
  struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
  }
  
  #define FIEMAP_LEN_MASK 0xff
  #define FIEMAP_LEN_HOLE 0x01
  #define FIEMAP_LEN_UNWRITTEN0x02
 
 I'm not sure I like stealing bits from the length to use a flags -
 I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern.  I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE  0x0001  /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x0002  /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN   0x0004  /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x0008  /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x0010  /* no direct data access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)

- separate flags word for fm_flags:
  - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
  - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
  - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
  - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
  sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
  - FIEMAP_EXTENT_HOLE = extent has no space allocation
  - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
  - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
  - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == errno?
  - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
  (e.g. OCFS, GFS, ZFS, Lustre).  This would otherwise have been unused.


 Given that xfs_bmap uses extra information from the filesystem
 (geometry) to display extra (and frequently used) information
 about the alignment of extents. ie:
 
 chook 681% xfs_bmap -vv fred
 fred:
  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
  FLAG Values:
 01 Unwritten preallocated extent
 001000 Doesn't begin on stripe unit
 000100 Doesn't end   on stripe unit
 10 Doesn't begin on stripe width
 01 Doesn't end   on stripe width

Can you clarify the terminology here?  What is a stripe unit and what is
a stripe width?  Are there N * stripe_unit = stripe_width in e.g. a
RAID