The starvation occurs when some process sends large requests to the same scsi
controller as our journal replay which sends one block requests, and the one
block requests starve. Raid-resync is one known instance where this happens.
Edward's patch cures that instance.
Hans
Edward Shushkin wrote:
>
> Philippe Gramoulle wrote:
> >
> > Hi,
> >
> > We've setup a test system : Linux box with a PERC3/QC AMI RAID card
> > (MegaRAID driver) with 3 diskshelves of 12x36Go. ( RAID5 , 2 spare
> > disks on each shelf ,1 Terabyte total)
> >
> > First of all, there is some odd message at boot :
> >
> > megaraid: v1.15d (Release Date: Wed May 30 17:30:41 EDT 2001)
> > megaraid: found 0x101e:0x1960:idx 0:bus 2:slot 0:func 0
> > scsi2 : Found a MegaRAID controller at 0xf8902000, IRQ: 20
> > megaraid: [1.57:3.13] detected 2 logical drives
> > scsi2 : AMI MegaRAID 1.57 254 commands 16 targs 4 chans 40 luns
> > Attached scsi disk sdb at scsi2, channel 4, id 0, lun 0
> > Attached scsi disk sdc at scsi2, channel 4, id 0, lun 1
> > SCSI device sdb: 318468096 512-byte hdwr sectors (163056 MB)
> > sdb: sdb1 ^^^^^^^^^^^
> > SCSI device sdc: 2059595776 512-byte hdwr sectors (-44998 MB)
> > ^^^^^^^^^^^
> > Why does sdc is reporting "-44998 MB" ??
> >
> > Nevertheless, fdisk'ing /dev/sdc runs fine.
> > mkreiserfs runs fine as well.
> >
> > Mounting the partition for the first time took 32 minutes.
> > umounting and remounting the partition for the second time took 1 minute
> > 32s.
> > Re-unmounting and re-remounting the partition took 32 minutes again .
> >
> > There were absolutely no operations done in between.
> >
> > What do you think takes so much time for the mount ?
>
> It looks like you got the worst case when the system tries to find
> valid transaction
> (when your fs is just created or you have fs that was non-cleanly
> unmounted)
> and reads all journal blocks during raid5-resync process that causes a
> large number
> of IO requests. If so, there can not be more then one journal request in
> the queue
> due to wait_on_buffer and this request can not be merged with the other
> journal requsts.
> Probably you want the attached patch against 2.4.7 that uses read ahead
> of 32 journal blocks
> instead bread(). We have tested it a bit - time of mount seems to be
> reduced..
> Please report about your results.
> Thanks,
> Edward.
>
> >
> > Aren't we hitting a 32 bits issue here ? Replacing 36Go disks with 73Go
> > disks
> > would give me a : unable to open /dev/sdc when trying to do the fdisk.
> >
> > Has someone already created volumes above 1 terabytes ?
> >
> > We're currently trying the same tests with ext2 but mke2fs takes a
> > *long* time compared to mkreiserfs :o). I'll give you the results soon.
> >
> > Thanks,
> >
> > Philippe.
>
> --------------------------------------------------------------------------------
> --- linux-2.4.7/fs/reiserfs/journal.c Mon Aug 6 15:29:31 2001
> +++ linux-2.4.7-new/fs/reiserfs/journal.c Tue Aug 14 22:57:06 2001
> @@ -81,6 +81,7 @@
> DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
>
> #define JOURNAL_TRANS_HALF 1018 /* must be correct to keep the desc and commit
>structs at 4k */
> +#define NBUF 32 /* read ahead */
>
> /* cnode stat bits. Move these into reiserfs_fs.h */
>
> @@ -1597,7 +1598,9 @@
> int replay_count = 0 ;
> int continue_replay = 1 ;
> int ret ;
> -
> + int need_read_ahead = 1;
> + int first_read_ahead = 0;
> + struct buffer_head * log_blocks[NBUF];
> cur_dblock = reiserfs_get_journal_block(p_s_sb) ;
> printk("reiserfs: checking transaction log (device %s) ...\n",
> kdevname(p_s_sb->s_dev)) ;
> @@ -1653,7 +1656,29 @@
> ** all the valid transactions, and pick out the oldest.
> */
> while(continue_replay && cur_dblock < (reiserfs_get_journal_block(p_s_sb) +
>JOURNAL_BLOCK_COUNT)) {
> - d_bh = bread(p_s_sb->s_dev, cur_dblock, p_s_sb->s_blocksize) ;
> + if (need_read_ahead) {
> + /* read ahead NBUF buffers */
> + int i;
> + first_read_ahead = cur_dblock;
> + for (i = 0; i < NBUF; i++) {
> + log_blocks [i] = getblk (p_s_sb->s_dev, first_read_ahead + i,
> + p_s_sb->s_blocksize);
> + if (!log_blocks [i]) {
> + brelse_array (log_blocks, i);
> + return -1;
> + }
> + }
> + ll_rw_block (READ, NBUF, log_blocks);
> + for (i = 0; i < NBUF; i++) {
> + wait_on_buffer (log_blocks [i]);
> + if (!buffer_uptodate (log_blocks [i])) {
> + brelse_array (log_blocks, NBUF);
> + return -1;
> + }
> + }
> + need_read_ahead = 0;
> + }
> + d_bh = log_blocks[cur_dblock - first_read_ahead];
> ret = journal_transaction_is_valid(p_s_sb, d_bh, &oldest_invalid_trans_id,
>&newest_mount_id) ;
> if (ret == 1) {
> desc = (struct reiserfs_journal_desc *)d_bh->b_data ;
> @@ -1680,12 +1705,17 @@
> "newest_mount_id to %d\n", le32_to_cpu(desc->j_mount_id));
> }
> cur_dblock += le32_to_cpu(desc->j_len) + 2 ;
> - }
> - else {
> + } else
> + /* transaction not found */
> cur_dblock++ ;
> +
> + /* checking if we have to read new portion of journal blocks */
> + if (cur_dblock - first_read_ahead >= NBUF) {
> + brelse_array (log_blocks, NBUF);
> + need_read_ahead = 1;
> }
> - brelse(d_bh) ;
> }
> +
> /* step three, starting at the oldest transaction, replay */
> if (last_flush_start > 0) {
> oldest_start = last_flush_start ;