Re: The SX4 challenge
Jeff Garzik wrote: .. Thus, the "SX4 challenge" is a challenge to developers to figure out the most optimal configuration for this hardware, given the existing MD and DM work going on. .. This sort of RAID optimization hardware is not unique to the SX4, so hopefully we can work out a way to take advantage of similar/different RAID throughput features of other chipsets too (eventually). This could be a good topic for discussion/beer in San Jose next month.. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
The SX4 challenge
Promise just gave permission to post the docs for their PDC20621 (i.e. SX4) hardware: http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-1.2.pdf.bz2 joining the existing PDC20621 DIMM and PLL docs: http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-dimm-1.6.pdf.bz2 http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-pll-ata-timing-1.2.pdf.bz2 So, the SX4 is now open. Yay :) I am hoping to talk Mikael into becoming the sata_sx4 maintainer, and finally integrating my 'new-eh' conversion in libata-dev.git. But now is a good time to remind people how lame the sata_sx4 driver software really is -- and I should know, I wrote it. The SX4 hardware, simplified, is three pieces: XOR engine (for raid5), host<->board memcpy engine, and several ATA engines (and some helpful transaction sequencing features). Data for each WRITE command is first copied to the board RAM, then the ATA engines DMA to/from the board RAM. Data for each READ command is copied to board RAM via the ATA engines, then DMA'd across PCI to your host memory. Therefore, while it is not hardware RAID, the SX4 provides all the pieces necessary to offload RAID1 and RAID5, and handle other RAID levels optimally. RAID1 and 5 copies can be offloaded (provided all copies go to SX4-attached devices of course). RAID5 XOR gen and checking can be offloaded, allowing the OS to see a single request, while the hardware processes a sequence of low-level requests sent in a batch. This hardware presents an interesting challenge: it does not really fit into software RAID (i.e. no RAID) /or/ hardware RAID categories. The sata_sx4 driver presents the no-RAID configuration, while is terribly inefficient: WRITE: submit host DMA (copy to board) host DMA completion via interrupt submit ATA command ATA command completion via interrupt READ: submit ATA command ATA command completion via interrupt submit host DMA (copy from board) host DMA completion via interrupt Thus, the "SX4 challenge" is a challenge to developers to figure out the most optimal configuration for this hardware, given the existing MD and DM work going on. Now, it must be noted that the SX4 is not current-gen technology. Most vendors have moved towards an "IOP" model, where the hw vendor puts most of their hard work into an ARM/MIPS firmware, running on an embedded chip specially tuned for storage purposes. (ref "hptiop" and "stex" drivers, very very small SCSI drivers) I know Dan Williams @ Intel is working on very similar issues on the IOP -- async memcpy, XOR offload, etc. -- and I am hoping that, due to that current work, some of the good ideas can be reused with the SX4. Anyway... it's open, it's interesting, even if it's not current-gen tech anymore. You can probably find them on Ebay or in an out-of-the-way computer shop somewhere. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I get rid of old device?
On Thu, 17 Jan 2008, Neil Brown wrote: On Wednesday January 16, [EMAIL PROTECTED] wrote: p34:~# mdadm /dev/md3 --zero-superblock p34:~# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=2 UUID=f463057c:9a696419:3bcb794a:7aaa12b2 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=98e4948c:c6685f82:e082fd95:e7f45529 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=330c9879:73af7d3e:57f4c139:f9191788 ARRAY /dev/md3 level=raid0 num-devices=10 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 p34:~# I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it on the root (2) disks or reference to it? I also dd'd the other 10 disks (non-root) and /dev/md3 persists. You don't zero the superblock on the array device, because the array device does not have a superblock. The component devices have the superblock. So mdadm --zero-superblock /dev/sd* or whatever. Maybe mdadm --examine --scan -v then get the list of devices it found for the array you want to kill, and --zero-superblock that list. NeilBrown Thanks, will keep this in mind for the future-- I just checked and the dd's have finished and there is no longer a /dev/md3, but mdadm --zero-superblock /dev/sd[c-l] would have been much easier. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I get rid of old device?
On Wednesday January 16, [EMAIL PROTECTED] wrote: > p34:~# mdadm /dev/md3 --zero-superblock > p34:~# mdadm --examine --scan > ARRAY /dev/md0 level=raid1 num-devices=2 > UUID=f463057c:9a696419:3bcb794a:7aaa12b2 > ARRAY /dev/md1 level=raid1 num-devices=2 > UUID=98e4948c:c6685f82:e082fd95:e7f45529 > ARRAY /dev/md2 level=raid1 num-devices=2 > UUID=330c9879:73af7d3e:57f4c139:f9191788 > ARRAY /dev/md3 level=raid0 num-devices=10 > UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 > p34:~# > > I cannot seem to get rid of /dev/md3, its almost as if there is a piece of > it on the root (2) disks or reference to it? > > I also dd'd the other 10 disks (non-root) and /dev/md3 persists. You don't zero the superblock on the array device, because the array device does not have a superblock. The component devices have the superblock. So mdadm --zero-superblock /dev/sd* or whatever. Maybe mdadm --examine --scan -v then get the list of devices it found for the array you want to kill, and --zero-superblock that list. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I get rid of old device?
On Wed, 16 Jan 2008, Justin Piszcz wrote: p34:~# mdadm /dev/md3 --zero-superblock p34:~# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=2 UUID=f463057c:9a696419:3bcb794a:7aaa12b2 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=98e4948c:c6685f82:e082fd95:e7f45529 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=330c9879:73af7d3e:57f4c139:f9191788 ARRAY /dev/md3 level=raid0 num-devices=10 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 p34:~# I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it on the root (2) disks or reference to it? I also dd'd the other 10 disks (non-root) and /dev/md3 persists. Hopefully this will clear it out: p34:~# for i in /dev/sd[c-l]; do /usr/bin/time dd if=/dev/zero of=$i bs=1M & done [1] 4625 [2] 4626 [3] 4627 [4] 4628 [5] 4629 [6] 4630 [7] 4631 [8] 4632 [9] 4633 [10] 4634 p34:~# Good aggregate bandwidth at least writing to all 10 disks. procs ---memory-- ---swap-- -io -system--cpu r b swpd free buff cache si sobibo in cs us sy id wa 1 9 0 46472 7201008 7342400 0 658756 2339 2242 0 22 24 54 3 10 0 44132 7204680 7329200 0 660040 2335 2276 0 22 19 59 5 8 0 48196 7201840 7373600 0 652708 2403 1645 0 23 11 66 2 9 0 45728 7205036 7262800 0 659844 2296 1891 0 23 11 66 0 11 0 47672 7202992 7256400 0 672856 2327 1616 0 22 7 71 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
How do I get rid of old device?
p34:~# mdadm /dev/md3 --zero-superblock p34:~# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=2 UUID=f463057c:9a696419:3bcb794a:7aaa12b2 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=98e4948c:c6685f82:e082fd95:e7f45529 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=330c9879:73af7d3e:57f4c139:f9191788 ARRAY /dev/md3 level=raid0 num-devices=10 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 p34:~# I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it on the root (2) disks or reference to it? I also dd'd the other 10 disks (non-root) and /dev/md3 persists. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Done testing for now, but I did test with 256k with a 256k chunk and obviously that got good results, just like 1m with a 1mb chunk, 460-480 MiB/s. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Tuesday January 15, [EMAIL PROTECTED] wrote: > On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <[EMAIL PROTECTED]> wrote: > > > > heheh. > > > > > > it's really easy to reproduce the hang without the patch -- i could > > > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. > > > i'll try with ext3... Dan's experiences suggest it won't happen with ext3 > > > (or is even more rare), which would explain why this has is overall a > > > rare problem. > > > > > > > Hmmm... how rare? > > > > http://marc.info/?l=linux-kernel&m=119461747005776&w=2 > > > > There is nothing specific that prevents other filesystems from hitting > > it, perhaps XFS is just better at submitting large i/o's. -stable > > should get some kind of treatment. I'll take altered performance over > > a hung system. > > We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling > wimpy. I am feeling wimpy. There've been a few too many raid5 breakages recently and it is very hard to really judge the performance impact of this change. I even have a small uncertainty of correctness - could it still hang in some other way? I don't think so, but this is complex code... If it were really common I would have expected more noise on the mailing list. Sure, there has been some, but not much. However maybe people are searching the archives and finding the "increase stripe cache size" trick, and not reporting anything seems unlikely though. How about we queue it for 2.6.25-rc1 and then about when -rc2 comes out, we queue it for 2.6.24.y? Any one (or any distro) that really needs it can of course grab the patch them selves... ?? NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: > On Wed, 16 Jan 2008, Al Boldi wrote: > > > Also, can you retest using dd with different block-sizes? > > I can do this, moment.. > > > I know about oflag=direct but I choose to use dd with sync and measure the > total time it takes. > /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero > of=/r1/bigfile bs=1M count=10240; sync' > > So I was asked on the mailing list to test dd with various chunk sizes, > here is the length of time it took > to write 10 GiB and sync per each chunk size: > > 4=chunk.txt:0:25.46 > 8=chunk.txt:0:25.63 > 16=chunk.txt:0:25.26 > 32=chunk.txt:0:25.08 > 64=chunk.txt:0:25.55 > 128=chunk.txt:0:25.26 > 256=chunk.txt:0:24.72 > 512=chunk.txt:0:24.71 > 1024=chunk.txt:0:25.40 > 2048=chunk.txt:0:25.71 > 4096=chunk.txt:0:27.18 > 8192=chunk.txt:0:29.00 > 16384=chunk.txt:0:31.43 > 32768=chunk.txt:0:50.11 > 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
(no subject)
unsubscribe linux-raid - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Greg Cormier wrote: What sort of tools are you using to get these benchmarks, and can I used them for ext3? Very interested in running this on my server. Thanks, Greg You can use whatever suits you, such as untar kernel source tree, copy files, untar backups, etc--, you should benchmark specifically what *your* workload is. Here is the skeleton, using bash:: (don't forget to turn off the cron daemon) for i in 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 do cd / umount /r1 mdadm -S /dev/md3 mdadm --create --assume-clean --verbose /dev/md3 --level=5 --raid-devices=10 --chunk=$i --run /dev/sd[c-l]1 /etc/init.d/oraid.sh # to optimize my raid stuff mkfs.xfs -f /dev/md3 mount /dev/md3 /r1 -o logbufs=8,logbsize=262144 # then simply add what you do often here # everyone's workload is different /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' done Then just, grep : /root/*chunk* | sort -n to get the results in the same format. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
What sort of tools are you using to get these benchmarks, and can I used them for ext3? Very interested in running this on my server. Thanks, Greg On Jan 16, 2008 11:13 AM, Justin Piszcz <[EMAIL PROTECTED]> wrote: > For these benchmarks I timed how long it takes to extract a standard 4.4 > GiB DVD: > > Settings: Software RAID 5 with the following settings (until I change > those too): > > Base setup: > blockdev --setra 65536 /dev/md3 > echo 16384 > /sys/block/md3/md/stripe_cache_size > echo "Disabling NCQ on all disks..." > for i in $DISKS > do >echo "Disabling NCQ on $i" >echo 1 > /sys/block/"$i"/device/queue_depth > done > > p34:~# grep : *chunk* |sort -n > 4-chunk.txt:0:45.31 > 8-chunk.txt:0:44.32 > 16-chunk.txt:0:41.02 > 32-chunk.txt:0:40.50 > 64-chunk.txt:0:40.88 > 128-chunk.txt:0:40.21 > 256-chunk.txt:0:40.14*** > 512-chunk.txt:0:40.35 > 1024-chunk.txt:0:41.11 > 2048-chunk.txt:0:43.89 > 4096-chunk.txt:0:47.34 > 8192-chunk.txt:0:57.86 > 16384-chunk.txt:1:09.39 > 32768-chunk.txt:1:26.61 > > It would appear a 256 KiB chunk-size is optimal. > > So what about NCQ? > > 1=ncq_depth.txt:0:40.86*** > 2=ncq_depth.txt:0:40.99 > 4=ncq_depth.txt:0:42.52 > 8=ncq_depth.txt:0:43.57 > 16=ncq_depth.txt:0:42.54 > 31=ncq_depth.txt:0:42.51 > > Keeping it off seems best. > > 1=stripe_and_read_ahead.txt:0:40.86 > 2=stripe_and_read_ahead.txt:0:40.99 > 4=stripe_and_read_ahead.txt:0:42.52 > 8=stripe_and_read_ahead.txt:0:43.57 > 16=stripe_and_read_ahead.txt:0:42.54 > 31=stripe_and_read_ahead.txt:0:42.51 > 256=stripe_and_read_ahead.txt:1:44.16 > 1024=stripe_and_read_ahead.txt:1:07.01 > 2048=stripe_and_read_ahead.txt:0:53.59 > 4096=stripe_and_read_ahead.txt:0:45.66 > 8192=stripe_and_read_ahead.txt:0:40.73 >16384=stripe_and_read_ahead.txt:0:38.99** > 16384=stripe_and_65536_read_ahead.txt:0:38.67 > 16384=stripe_and_65536_read_ahead.txt:0:38.69 (again, this is what I use > from earlier benchmarks) > 32768=stripe_and_read_ahead.txt:0:38.84 > > What about logbufs? > > 2=logbufs.txt:0:39.21 > 4=logbufs.txt:0:39.24 > 8=logbufs.txt:0:38.71 > > (again) > > 2=logbufs.txt:0:42.16 > 4=logbufs.txt:0:38.79 > 8=logbufs.txt:0:38.71** (yes) > > What about logbsize? > > 16k=logbsize.txt:1:09.22 > 32k=logbsize.txt:0:38.70 > 64k=logbsize.txt:0:39.04 > 128k=logbsize.txt:0:39.06 > 256k=logbsize.txt:0:38.59** (best) > > > What about allocsize? (default=1024k) > > 4k=allocsize.txt:0:39.35 > 8k=allocsize.txt:0:38.95 > 16k=allocsize.txt:0:38.79 > 32k=allocsize.txt:0:39.71 > 64k=allocsize.txt:1:09.67 > 128k=allocsize.txt:0:39.04 > 256k=allocsize.txt:0:39.11 > 512k=allocsize.txt:0:39.01 > 1024k=allocsize.txt:0:38.75** (default) > 2048k=allocsize.txt:0:39.07 > 4096k=allocsize.txt:0:39.15 > 8192k=allocsize.txt:0:39.40 > 16384k=allocsize.txt:0:39.36 > > What about the agcount? > > 2=agcount.txt:0:37.53 > 4=agcount.txt:0:38.56 > 8=agcount.txt:0:40.86 > 16=agcount.txt:0:39.05 > 32=agcount.txt:0:39.07** (default) > 64=agcount.txt:0:39.29 > 128=agcount.txt:0:39.42 > 256=agcount.txt:0:38.76 > 512=agcount.txt:0:38.27 > 1024=agcount.txt:0:38.29 > 2048=agcount.txt:1:08.55 > 4096=agcount.txt:0:52.65 > 8192=agcount.txt:1:06.96 > 16384=agcount.txt:1:31.21 > 32768=agcount.txt:1:09.06 > 65536=agcount.txt:1:54.96 > > > So far I have: > > p34:~# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 /dev/md3 > meta-data=/dev/md3 isize=256agcount=32, agsize=10302272 > blks > = sectsz=4096 attr=2 > data = bsize=4096 blocks=329671296, imaxpct=25 > = sunit=64 swidth=576 blks, unwritten=1 > naming =version 2 bsize=4096 > log =internal log bsize=4096 blocks=32768, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=2359296 blocks=0, rtextents=0 > > p34:~# grep /dev/md3 /etc/fstab > /dev/md3/r1 xfs > noatime,nodiratime,logbufs=8,logbsize=262144 0 1 > > Notice how mkfs.xfs 'knows' the sunit and swidth, and it is the correct > units too because it is software raid, and it pulls this information from > that layer, unlike HW raid which will not have a clue of what is > underneath and say sunit=0,swidth=0. > > However, in earlier testing I actually made them both 0 and it actually > made performance better: > > http://home.comcast.net/~jpiszcz/sunit-swidth/results.html > > In any case, I am re-running bonnie++ once more with a 256 KiB chunk and > will compare to those values in a bit. > > Justin. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): Base setup: blockdev --setra 65536 /dev/md3 echo 16384 > /sys/block/md3/md/stripe_cache_size echo "Disabling NCQ on all disks..." for i in $DISKS do echo "Disabling NCQ on $i" echo 1 > /sys/block/"$i"/device/queue_depth done p34:~# grep : *chunk* |sort -n 4-chunk.txt:0:45.31 8-chunk.txt:0:44.32 16-chunk.txt:0:41.02 32-chunk.txt:0:40.50 64-chunk.txt:0:40.88 128-chunk.txt:0:40.21 256-chunk.txt:0:40.14*** 512-chunk.txt:0:40.35 1024-chunk.txt:0:41.11 2048-chunk.txt:0:43.89 4096-chunk.txt:0:47.34 8192-chunk.txt:0:57.86 16384-chunk.txt:1:09.39 32768-chunk.txt:1:26.61 It would appear a 256 KiB chunk-size is optimal. Can you retest with different max_sectors_kb on both md and sd? Remember this is SW RAID, so max_sectors_kb will only affect the individual disks underneath the SW RAID, I have benchmarked in the past, the defaults chosen by the kernel are optimal, changing them did not make any noticable improvements. > Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: > For these benchmarks I timed how long it takes to extract a standard 4.4 > GiB DVD: > > Settings: Software RAID 5 with the following settings (until I change > those too): > > Base setup: > blockdev --setra 65536 /dev/md3 > echo 16384 > /sys/block/md3/md/stripe_cache_size > echo "Disabling NCQ on all disks..." > for i in $DISKS > do >echo "Disabling NCQ on $i" >echo 1 > /sys/block/"$i"/device/queue_depth > done > > p34:~# grep : *chunk* |sort -n > 4-chunk.txt:0:45.31 > 8-chunk.txt:0:44.32 > 16-chunk.txt:0:41.02 > 32-chunk.txt:0:40.50 > 64-chunk.txt:0:40.88 > 128-chunk.txt:0:40.21 > 256-chunk.txt:0:40.14*** > 512-chunk.txt:0:40.35 > 1024-chunk.txt:0:41.11 > 2048-chunk.txt:0:43.89 > 4096-chunk.txt:0:47.34 > 8192-chunk.txt:0:57.86 > 16384-chunk.txt:1:09.39 > 32768-chunk.txt:1:26.61 > > It would appear a 256 KiB chunk-size is optimal. Can you retest with different max_sectors_kb on both md and sd? Also, can you retest using dd with different block-sizes? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Justin Piszcz wrote: For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): http://home.comcast.net/~jpiszcz/sunit-swidth/newresults.html Any idea why an sunit and swidth of 0 (and -d agcount=4) is faster at least with sequential input/output than the proper sunit/swidth that it should be? It does not make sense. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): Base setup: blockdev --setra 65536 /dev/md3 echo 16384 > /sys/block/md3/md/stripe_cache_size echo "Disabling NCQ on all disks..." for i in $DISKS do echo "Disabling NCQ on $i" echo 1 > /sys/block/"$i"/device/queue_depth done p34:~# grep : *chunk* |sort -n 4-chunk.txt:0:45.31 8-chunk.txt:0:44.32 16-chunk.txt:0:41.02 32-chunk.txt:0:40.50 64-chunk.txt:0:40.88 128-chunk.txt:0:40.21 256-chunk.txt:0:40.14*** 512-chunk.txt:0:40.35 1024-chunk.txt:0:41.11 2048-chunk.txt:0:43.89 4096-chunk.txt:0:47.34 8192-chunk.txt:0:57.86 16384-chunk.txt:1:09.39 32768-chunk.txt:1:26.61 It would appear a 256 KiB chunk-size is optimal. So what about NCQ? 1=ncq_depth.txt:0:40.86*** 2=ncq_depth.txt:0:40.99 4=ncq_depth.txt:0:42.52 8=ncq_depth.txt:0:43.57 16=ncq_depth.txt:0:42.54 31=ncq_depth.txt:0:42.51 Keeping it off seems best. 1=stripe_and_read_ahead.txt:0:40.86 2=stripe_and_read_ahead.txt:0:40.99 4=stripe_and_read_ahead.txt:0:42.52 8=stripe_and_read_ahead.txt:0:43.57 16=stripe_and_read_ahead.txt:0:42.54 31=stripe_and_read_ahead.txt:0:42.51 256=stripe_and_read_ahead.txt:1:44.16 1024=stripe_and_read_ahead.txt:1:07.01 2048=stripe_and_read_ahead.txt:0:53.59 4096=stripe_and_read_ahead.txt:0:45.66 8192=stripe_and_read_ahead.txt:0:40.73 16384=stripe_and_read_ahead.txt:0:38.99** 16384=stripe_and_65536_read_ahead.txt:0:38.67 16384=stripe_and_65536_read_ahead.txt:0:38.69 (again, this is what I use from earlier benchmarks) 32768=stripe_and_read_ahead.txt:0:38.84 What about logbufs? 2=logbufs.txt:0:39.21 4=logbufs.txt:0:39.24 8=logbufs.txt:0:38.71 (again) 2=logbufs.txt:0:42.16 4=logbufs.txt:0:38.79 8=logbufs.txt:0:38.71** (yes) What about logbsize? 16k=logbsize.txt:1:09.22 32k=logbsize.txt:0:38.70 64k=logbsize.txt:0:39.04 128k=logbsize.txt:0:39.06 256k=logbsize.txt:0:38.59** (best) What about allocsize? (default=1024k) 4k=allocsize.txt:0:39.35 8k=allocsize.txt:0:38.95 16k=allocsize.txt:0:38.79 32k=allocsize.txt:0:39.71 64k=allocsize.txt:1:09.67 128k=allocsize.txt:0:39.04 256k=allocsize.txt:0:39.11 512k=allocsize.txt:0:39.01 1024k=allocsize.txt:0:38.75** (default) 2048k=allocsize.txt:0:39.07 4096k=allocsize.txt:0:39.15 8192k=allocsize.txt:0:39.40 16384k=allocsize.txt:0:39.36 What about the agcount? 2=agcount.txt:0:37.53 4=agcount.txt:0:38.56 8=agcount.txt:0:40.86 16=agcount.txt:0:39.05 32=agcount.txt:0:39.07** (default) 64=agcount.txt:0:39.29 128=agcount.txt:0:39.42 256=agcount.txt:0:38.76 512=agcount.txt:0:38.27 1024=agcount.txt:0:38.29 2048=agcount.txt:1:08.55 4096=agcount.txt:0:52.65 8192=agcount.txt:1:06.96 16384=agcount.txt:1:31.21 32768=agcount.txt:1:09.06 65536=agcount.txt:1:54.96 So far I have: p34:~# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 /dev/md3 meta-data=/dev/md3 isize=256agcount=32, agsize=10302272 blks = sectsz=4096 attr=2 data = bsize=4096 blocks=329671296, imaxpct=25 = sunit=64 swidth=576 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=2359296 blocks=0, rtextents=0 p34:~# grep /dev/md3 /etc/fstab /dev/md3/r1 xfs noatime,nodiratime,logbufs=8,logbsize=262144 0 1 Notice how mkfs.xfs 'knows' the sunit and swidth, and it is the correct units too because it is software raid, and it pulls this information from that layer, unlike HW raid which will not have a clue of what is underneath and say sunit=0,swidth=0. However, in earlier testing I actually made them both 0 and it actually made performance better: http://home.comcast.net/~jpiszcz/sunit-swidth/results.html In any case, I am re-running bonnie++ once more with a 256 KiB chunk and will compare to those values in a bit. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html