The starvation occurs when some process sends large requests to the same scsi
controller as our journal replay which sends one block requests, and the one
block requests starve.  Raid-resync is one known instance where this happens. 
Edward's patch cures that instance.

Hans

Edward Shushkin wrote:
> 
> Philippe Gramoulle wrote:
> >
> > Hi,
> >
> > We've setup a test system : Linux box with a PERC3/QC AMI RAID card
> > (MegaRAID driver) with 3 diskshelves of  12x36Go. ( RAID5 , 2 spare
> > disks on each shelf ,1 Terabyte total)
> >
> > First of all, there is some odd message at boot :
> >
> > megaraid: v1.15d (Release Date: Wed May 30 17:30:41 EDT 2001)
> > megaraid: found 0x101e:0x1960:idx 0:bus 2:slot 0:func 0
> > scsi2 : Found a MegaRAID controller at 0xf8902000, IRQ: 20
> > megaraid: [1.57:3.13] detected 2 logical drives
> > scsi2 : AMI MegaRAID 1.57 254 commands 16 targs 4 chans 40 luns
> > Attached scsi disk sdb at scsi2, channel 4, id 0, lun 0
> > Attached scsi disk sdc at scsi2, channel 4, id 0, lun 1
> > SCSI device sdb: 318468096 512-byte hdwr sectors (163056 MB)
> >   sdb: sdb1                                      ^^^^^^^^^^^
> > SCSI device sdc: 2059595776 512-byte hdwr sectors (-44998 MB)
> >                                                   ^^^^^^^^^^^
> > Why does sdc is reporting "-44998 MB" ??
> >
> > Nevertheless, fdisk'ing /dev/sdc runs fine.
> > mkreiserfs runs fine as well.
> >
> > Mounting the partition for the first time took 32 minutes.
> > umounting and remounting the partition for the second time took 1 minute
> > 32s.
> > Re-unmounting and re-remounting the partition took 32 minutes again .
> >
> > There were absolutely no operations done in between.
> >
> > What do you think takes so much time for the mount ?
> 
>  It looks like you got the worst case when the system tries to find
> valid transaction
> (when your fs is just created or you have fs that was non-cleanly
> unmounted)
> and reads all journal blocks during raid5-resync process that causes a
> large number
> of IO requests. If so, there can not be more then one journal request in
> the queue
> due to wait_on_buffer and this request can not be merged with the other
> journal requsts.
> Probably you want the attached patch against 2.4.7 that uses read ahead
> of 32 journal blocks
> instead bread(). We have tested it a bit - time of mount seems to be
> reduced..
> Please report about your results.
> Thanks,
> Edward.
> 
> >
> > Aren't we hitting a 32 bits issue here ? Replacing 36Go disks with 73Go
> > disks
> > would give me a : unable to open /dev/sdc when trying to do the fdisk.
> >
> > Has someone already created volumes above 1 terabytes ?
> >
> > We're currently trying the same tests with ext2 but mke2fs takes a
> > *long* time compared to mkreiserfs :o). I'll give you the results soon.
> >
> > Thanks,
> >
> > Philippe.
> 
>   --------------------------------------------------------------------------------
> --- linux-2.4.7/fs/reiserfs/journal.c   Mon Aug  6 15:29:31 2001
> +++ linux-2.4.7-new/fs/reiserfs/journal.c       Tue Aug 14 22:57:06 2001
> @@ -81,6 +81,7 @@
>  DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
> 
>  #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and commit 
>structs at 4k */
> +#define NBUF 32 /* read ahead */
> 
>  /* cnode stat bits.  Move these into reiserfs_fs.h */
> 
> @@ -1597,7 +1598,9 @@
>    int replay_count = 0 ;
>    int continue_replay = 1 ;
>    int ret ;
> -
> +  int need_read_ahead = 1;
> +  int first_read_ahead = 0;
> +  struct buffer_head * log_blocks[NBUF];
>    cur_dblock = reiserfs_get_journal_block(p_s_sb) ;
>    printk("reiserfs: checking transaction log (device %s) ...\n",
>            kdevname(p_s_sb->s_dev)) ;
> @@ -1653,7 +1656,29 @@
>    ** all the valid transactions, and pick out the oldest.
>    */
>    while(continue_replay && cur_dblock < (reiserfs_get_journal_block(p_s_sb) + 
>JOURNAL_BLOCK_COUNT)) {
> -    d_bh = bread(p_s_sb->s_dev, cur_dblock, p_s_sb->s_blocksize) ;
> +    if (need_read_ahead) {
> +      /* read ahead NBUF buffers */
> +      int i;
> +      first_read_ahead = cur_dblock;
> +      for (i = 0; i < NBUF; i++) {
> +       log_blocks [i] = getblk (p_s_sb->s_dev, first_read_ahead + i,
> +                                p_s_sb->s_blocksize);
> +       if (!log_blocks [i]) {
> +         brelse_array (log_blocks, i);
> +         return -1;
> +       }
> +      }
> +      ll_rw_block (READ, NBUF, log_blocks);
> +      for (i = 0; i < NBUF; i++) {
> +       wait_on_buffer (log_blocks [i]);
> +       if (!buffer_uptodate (log_blocks [i])) {
> +         brelse_array (log_blocks, NBUF);
> +         return -1;
> +       }
> +      }
> +      need_read_ahead = 0;
> +    }
> +    d_bh = log_blocks[cur_dblock - first_read_ahead];
>      ret = journal_transaction_is_valid(p_s_sb, d_bh, &oldest_invalid_trans_id, 
>&newest_mount_id) ;
>      if (ret == 1) {
>        desc = (struct reiserfs_journal_desc *)d_bh->b_data ;
> @@ -1680,12 +1705,17 @@
>                       "newest_mount_id to %d\n", le32_to_cpu(desc->j_mount_id));
>        }
>        cur_dblock += le32_to_cpu(desc->j_len) + 2 ;
> -    }
> -    else {
> +    } else
> +      /* transaction not found */
>        cur_dblock++ ;
> +
> +    /* checking if we have to read new portion of journal blocks */
> +    if (cur_dblock - first_read_ahead >= NBUF) {
> +      brelse_array (log_blocks, NBUF);
> +      need_read_ahead = 1;
>      }
> -    brelse(d_bh) ;
>    }
> +
>    /* step three, starting at the oldest transaction, replay */
>    if (last_flush_start > 0) {
>      oldest_start = last_flush_start ;

Reply via email to