Hi,

While working on a new file systems latency profiler, which we call
FSprof, we noticed that generic_file_llseek competes for i_sem if several
processes access the same file.  For example, we observed that contention
happens up to 25% of the time, even with just two processes randomly
reading the same file.  Even worse, i_sem (as well as other semaphores)
accesses require either locking the whole memory bus or at least purging
the same cache line from all other processors which should hurt
performance on SMP systems.  Fortunately, we reviewed the rest of the VFS
and file systems code and it seems that most file systems don't need to
touch i_sem during llseeks of files (but they do for directories).

Here is the generic_file_llseek used as the ->llseek file operation by
most of the file systems (see fs/read_write.c, Linux 2.6.10, lines 30-53):

loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
{
        long long retval;
        struct inode *inode = file->f_mapping->host;

        down(&inode->i_sem);
        switch (origin) {
                case 2:
                        offset += inode->i_size;
                        break;
                case 1:
                        offset += file->f_pos;
        }
        retval = -EINVAL;
        if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
                if (offset != file->f_pos) {
                        file->f_pos = offset;
                        file->f_version = 0;
                }
                retval = offset;
        }
        up(&inode->i_sem);
        return retval;
}

An important observation here is that we need to down the i_sem to
atomically update f_pos and f_version.  However, f_version is not
currently used for file objects by any of the file systems supplied with
the kernel.  f_version is only used by ufs, ntfs, hpfs, ext3, ext2, proc,
and affs for readdir operations of their directories and by fifos and
pipes for poll().

One may think that the i_sem protection is necessary to provide
consistency between i_size reads and f_pos reads/writes.  However, these
are not provided in other places of the kernel.  For example, consider the
code of sys_write (see lines 309-311 of fs/read_write.c):

        loff_t pos = file_pos_read(file);
        ret = vfs_write(file, buf, count, &pos);
        file_pos_write(file, pos);

In the above case, f_pos is read and written (using file_pos_read and
file_pos_write) without any provisions for atomicity; hence, we shouldn't
expect atomicity on f_pos.  Therefore, in the case of SEEK_CUR, we can
remove the down and up on i_sem in generic_file_llseek, which would only
be useful if transaction semantics were enforced on f_pos everywhere else
as well.  Clearly the same applies in the SEEK_SET case.

In the case of SEEK_END, the i_sem protection doesn't add any consistency
between i_size and f_pos either.  Some other process can modify i_size
right after the i_sem is released with the same effect as modifying it
right after i_size is read.  This is an existing, well-known inconsistency
that happens when several processes are appending to a file at the same
time and overwrite each other's data.  Note that we need to read i_size
using the i_size_read function to guarantee that the unaligned 64-bit
i_size is read atomically on 32-bit architectures.

Considering all the above, it seems that there is no need to down/up i_sem
for generic_file_llseek on files.  We can do it in two ways:

1) Add a line like:

        if (S_ISDIR(inode->i_mode))

   before the down/up of the i_sem in the generic_file_llseek.

2) Introduce generic_file_llseek_light function (or, pick a function name
   you deem more appropriate) which doesn't touch the i_sem and f_version.
   This approach is less intrusive: FS developers who like it can use it.

We've benchmarked both options.  Both cases result in very similar
performance improvements.  Thus, while testing the new code for case 2 on
our P4 SMP system, we found that removal of the down/up pair reduces
average generic_file_llseek execution time from about 400 cycles to 120
cycles, a 70% reduction even if there is no contention on i_sem.  Most
importantly, however, neither the memory bus is locked two extra times nor
the inode structure is evicted from the caches of all other CPUs which is
significant for SMP systems.  And, of course, now there is no i_sem
contention if several processes access the same file and therefore share
the inode; contention happens about 25% of the time with the original
lseek, even with just two processes randomly reading the same file opened
with O_DIRECT.

The patch below adds a new function, generic_file_llseek_light, and
substitutes it for generic_file_llseek in file seeks in ext2/ext3.  (Seeks
in directories are left unchanged.)  We have used a kernel patched in this
way for more than a month without problems, even under FS-intensive
benchmarks.  We therefore recommend that this patch be considered (in some
form) in 2.6 and that all other file system developers consider
integrating it into their code as well.

Note that this patch only fixes ext2/3, and is here primarily to
illustrate one possible way to fix this issue; there may be other/better
ways to address this contention.  Either way, if you apply it to other
file systems, then you need to use the "*_light" version of llseek
appropriately for those other file systems.

Cheers,
Nikolai Joukov.
------------------------------------------
PhD student.  Filesystems and Storage Lab.
Stony Brook University (SUNY).
Advisor: Erez Zadok
------------------------------------------

diff -Nrup linux-2.6.10-original/fs/read_write.c 
linux-2.6.10-light/fs/read_write.c
--- linux-2.6.10-original/fs/read_write.c       2004-12-24 16:35:00.000000000 
-0500
+++ linux-2.6.10-light/fs/read_write.c  2005-02-08 20:23:21.000000000 -0500
@@ -27,6 +27,27 @@ struct file_operations generic_ro_fops =

 EXPORT_SYMBOL(generic_ro_fops);

+loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin)
+{
+       long long retval = -EINVAL;
+       struct inode *inode = file->f_mapping->host;
+
+       switch (origin) {
+               case 2:
+                       offset += i_size_read(inode);
+                       break;
+               case 1:
+                       offset += file->f_pos;
+       }
+       if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
+               file->f_pos = offset;
+               retval = offset;
+       }
+       return retval;
+}
+
+EXPORT_SYMBOL(generic_file_llseek_light);
+
 loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
 {
        long long retval;
diff -rup linux-2.6.10-original/include/linux/fs.h 
linux-2.6.10-light/include/linux/fs.h
--- linux-2.6.10-original/include/linux/fs.h    2004-12-24 16:34:27.000000000 
-0500
+++ linux-2.6.10-light/include/linux/fs.h       2005-02-08 20:23:54.000000000 
-0500
@@ -1473,6 +1473,7 @@ extern ssize_t generic_file_readv(struct
 ssize_t generic_file_writev(struct file *filp, const struct iovec *iov,
                        unsigned long nr_segs, loff_t *ppos);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
+extern loff_t generic_file_llseek_light(struct file *file, loff_t offset, int 
origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int 
origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
 extern int generic_file_open(struct inode * inode, struct file * filp);
diff -rup linux-2.6.10-original/fs/ext2/file.c linux-2.6.10-light/fs/ext2/file.c
--- linux-2.6.10-original/fs/ext2/file.c        2004-12-24 16:34:31.000000000 
-0500
+++ linux-2.6.10-light/fs/ext2/file.c   2005-02-08 20:28:05.000000000 -0500
@@ -40,7 +40,7 @@ static int ext2_release_file (struct ino
  * the ext2 filesystem.
  */
 struct file_operations ext2_file_operations = {
-       .llseek         = generic_file_llseek,
+       .llseek         = generic_file_llseek_light,
        .read           = generic_file_read,
        .write          = generic_file_write,
        .aio_read       = generic_file_aio_read,
diff -rup linux-2.6.10-original/fs/ext3/file.c linux-2.6.10-light/fs/ext3/file.c
--- linux-2.6.10-original/fs/ext3/file.c        2004-12-24 16:35:39.000000000 
-0500
+++ linux-2.6.10-light/fs/ext3/file.c   2005-02-08 20:28:18.000000000 -0500
@@ -116,7 +116,7 @@ force_commit:
 }

 struct file_operations ext3_file_operations = {
-       .llseek         = generic_file_llseek,
+       .llseek         = generic_file_llseek_light,
        .read           = do_sync_read,
        .write          = do_sync_write,
        .aio_read       = generic_file_aio_read,
diff -rup linux-2.6.10-original/fs/read_write.c 
linux-2.6.10-light/fs/read_write.c
--- linux-2.6.10-original/fs/read_write.c       2004-12-24 16:35:00.000000000 
-0500
+++ linux-2.6.10-light/fs/read_write.c  2005-02-08 20:23:21.000000000 -0500
@@ -27,6 +27,27 @@ struct file_operations generic_ro_fops =
 
 EXPORT_SYMBOL(generic_ro_fops);
 
+loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin)
+{
+       long long retval = -EINVAL;
+       struct inode *inode = file->f_mapping->host;
+
+       switch (origin) {
+               case 2:
+                       offset += i_size_read(inode);
+                       break;
+               case 1:
+                       offset += file->f_pos;
+       }
+       if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
+               file->f_pos = offset;
+               retval = offset;
+       }
+       return retval;
+}
+
+EXPORT_SYMBOL(generic_file_llseek_light);
+
 loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
 {
        long long retval;
diff -rup linux-2.6.10-original/include/linux/fs.h 
linux-2.6.10-light/include/linux/fs.h
--- linux-2.6.10-original/include/linux/fs.h    2004-12-24 16:34:27.000000000 
-0500
+++ linux-2.6.10-light/include/linux/fs.h       2005-02-08 20:23:54.000000000 
-0500
@@ -1473,6 +1473,7 @@ extern ssize_t generic_file_readv(struct
 ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, 
                        unsigned long nr_segs, loff_t *ppos);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
+extern loff_t generic_file_llseek_light(struct file *file, loff_t offset, int 
origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int 
origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
 extern int generic_file_open(struct inode * inode, struct file * filp);
diff -rup linux-2.6.10-original/fs/ext2/file.c linux-2.6.10-light/fs/ext2/file.c
--- linux-2.6.10-original/fs/ext2/file.c        2004-12-24 16:34:31.000000000 
-0500
+++ linux-2.6.10-light/fs/ext2/file.c   2005-02-08 20:28:05.000000000 -0500
@@ -40,7 +40,7 @@ static int ext2_release_file (struct ino
  * the ext2 filesystem.
  */
 struct file_operations ext2_file_operations = {
-       .llseek         = generic_file_llseek,
+       .llseek         = generic_file_llseek_light,
        .read           = generic_file_read,
        .write          = generic_file_write,
        .aio_read       = generic_file_aio_read,
diff -rup linux-2.6.10-original/fs/ext3/file.c linux-2.6.10-light/fs/ext3/file.c
--- linux-2.6.10-original/fs/ext3/file.c        2004-12-24 16:35:39.000000000 
-0500
+++ linux-2.6.10-light/fs/ext3/file.c   2005-02-08 20:28:18.000000000 -0500
@@ -116,7 +116,7 @@ force_commit:
 }
 
 struct file_operations ext3_file_operations = {
-       .llseek         = generic_file_llseek,
+       .llseek         = generic_file_llseek_light,
        .read           = do_sync_read,
        .write          = do_sync_write,
        .aio_read       = generic_file_aio_read,

Reply via email to