Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-26 Thread Miroslav Lachman

Jeremy Chadwick wrote:

On Fri, Oct 17, 2008 at 04:09:17PM +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:


On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:


On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:

Today I was replacing disk in one Sun Fire X2100 M2 so I tried
hot-swapping. It was as you said: atacontrol detach ata3, replace 
the HDD, atacontrol attach ata3 and new disk is in the system. I 
tried it 3  times to be sure that it was not coincidence - no 
panic was produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 
7.0 i386  works.



That's excellent news.  So it seems possibly the problem I was seeing
was with reinit causing some sort of chaos.  I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual!  :-)


Unfortunately there is one problem - I see a lot of interrupts after  
disk swapping (about 193k of atapci1)


Interrupts
197k total
   ohci0 21
   ehci0 22
193k atapci1 23
2001 cpu0: time
 1 bge1 273
2001 cpu1: time



Okay, so it looks like the interrupt rate on atapci1 after swapping is
going crazy.  What you're showing there looks like heavily modified
vmstat -i output.


The shown is manually cropped from systat -vm, I'll try vmstat -i next  
time. ;)




Full output of systat -vm 2 is attached.

It is shown in top as 50% interrupt (CPU state) and load 1 until I   
rebooted the machine (I can provide MRTG graphs). The system was not 
in  production load, but almost idle. (I will put it in production 
tomorrow).

After reboot, everything is OK.



And this box is running the ATA patch Andrey provided, yes?


It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.


Can somebody test hot-swapping with SATA drives and confirm this   
behavior? (I can't test it now, because machine is in datacenter)



I can test it on my P4SCE box.

I'll check the interrupt rates after each step of the hot-swap to see
if/when the problem starts.


I'll check the interrupts next time too and will post results to this  
thread.



As promised, here are notes from my testing:


First thing to note is that the BIOS on my P4SCE had the ICH5 SATA mode
set to Auto, which was causing PATA emulation to happen on the SATA
controller, e.g.  disk #0 == ata0-master, disk #1 == ata0-slave.

I changed the BIOS option from Auto to SATA Enhanced, and now the
disks show up on their own channels, e.g.  disk #0 == ata2-master, disk
#1 == ata3-master.

Here's the applicable data.  Note that this kernel ***DOES*** include
Andrey's ATA patch:

FreeBSD testbox.home.lan 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Thu Oct 16 
10:56:42 PDT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/TESTBOX  i386

atapci1: Intel ICH5 SATA150 controller port 
0xc000-0xc007,0xc400-0xc403,0xc800-0xc807,0xcc00-0xcc03,0xd000-0xd00f irq 18 at 
device 31.2 on pci0
atapci1: [ITHREAD]
ata2: ATA channel 0 on atapci1
ata2: [ITHREAD]
ata3: ATA channel 1 on atapci1
ata3: [ITHREAD]

SATA controller is on IRQ 18.

ad4: 114473MB Seagate ST3120026AS 3.05 at ata2-master SATA150
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150

ATA channel 2:
Master:  ad4 ST3120026AS/3.05 Serial ATA v1.0
Slave:   no device present
ATA channel 3:
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# df -k
Filesystem  1024-blocksUsed Avail Capacity  Mounted on
/dev/ad4s1a  507630  23018223683849%/
devfs 1   1 0   100%/dev
/dev/ad4s1e  507630  12467008 0%/tmp
/dev/ad4s1f   108498334 2944826  96873642 3%/usr
/dev/ad4s1d 2008622   32360   1815574 2%/var
/dev/ad6s1d   236511738   4 217590796 0%/hotswap

testbox# vmstat -i
interrupt  total   rate
irq4: sio0  1398 34
irq6: fdc010  0
irq15: ata1   58  1
irq18: atapci1   945 23
irq23: em1 8  0
cpu0: timer80033   1952
cpu1: timer79808   1946
Total 162260   3957

testbox# umount /hotswap
testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached
testbox# vmstat -i | grep atapci1
irq18: atapci1  2671 11

At this point I wanted to see what happened if I just reattached without
any physical changes to the SATA bus.

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# vmstat -i | grep atapci1
irq18: atapci1  2764  9
testbox# mount /dev/ad6s1d /hotswap
testbox# vmstat -i | grep atapci1

Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-26 Thread Jeremy Chadwick
On Sun, Oct 26, 2008 at 01:41:58PM +0100, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 On Fri, Oct 17, 2008 at 04:09:17PM +0200, Miroslav Lachman wrote:

 Jeremy Chadwick wrote:

 On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:

 Jeremy Chadwick wrote:

 On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:

 Today I was replacing disk in one Sun Fire X2100 M2 so I 
 triedhot-swapping. It was as you said: atacontrol detach 
 ata3, replace the HDD, atacontrol attach ata3 and new disk is 
 in the system. I tried it 3  times to be sure that it was not 
 coincidence - no panic was produced ;o)
 So in this case, hot-swapping on Sun Fire X2100 M2 with 
 FreeBSD 7.0 i386  works.


 That's excellent news.  So it seems possibly the problem I was seeing
 was with reinit causing some sort of chaos.  I'll have to check things
 on my testbox here at home to see how I caused the panic last time.

 Thanks for providing feedback, as usual!  :-)

 Unfortunately there is one problem - I see a lot of interrupts 
 after  disk swapping (about 193k of atapci1)

 Interrupts
 197k total
ohci0 21
ehci0 22
 193k atapci1 23
 2001 cpu0: time
  1 bge1 273
 2001 cpu1: time


 Okay, so it looks like the interrupt rate on atapci1 after swapping is
 going crazy.  What you're showing there looks like heavily modified
 vmstat -i output.

 The shown is manually cropped from systat -vm, I'll try vmstat -i 
 next  time. ;)


 Full output of systat -vm 2 is attached.

 It is shown in top as 50% interrupt (CPU state) and load 1 until 
 I   rebooted the machine (I can provide MRTG graphs). The system 
 was not in  production load, but almost idle. (I will put it in 
 production tomorrow).
 After reboot, everything is OK.


 And this box is running the ATA patch Andrey provided, yes?

 It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.


 Can somebody test hot-swapping with SATA drives and confirm this  
  behavior? (I can't test it now, because machine is in 
 datacenter)


 I can test it on my P4SCE box.

 I'll check the interrupt rates after each step of the hot-swap to see
 if/when the problem starts.

 I'll check the interrupts next time too and will post results to this 
  thread.


 As promised, here are notes from my testing:


 First thing to note is that the BIOS on my P4SCE had the ICH5 SATA mode
 set to Auto, which was causing PATA emulation to happen on the SATA
 controller, e.g.  disk #0 == ata0-master, disk #1 == ata0-slave.

 I changed the BIOS option from Auto to SATA Enhanced, and now the
 disks show up on their own channels, e.g.  disk #0 == ata2-master, disk
 #1 == ata3-master.

 Here's the applicable data.  Note that this kernel ***DOES*** include
 Andrey's ATA patch:

 FreeBSD testbox.home.lan 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Thu Oct 
 16 10:56:42 PDT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/TESTBOX  i386

 atapci1: Intel ICH5 SATA150 controller port 
 0xc000-0xc007,0xc400-0xc403,0xc800-0xc807,0xcc00-0xcc03,0xd000-0xd00f irq 18 
 at device 31.2 on pci0
 atapci1: [ITHREAD]
 ata2: ATA channel 0 on atapci1
 ata2: [ITHREAD]
 ata3: ATA channel 1 on atapci1
 ata3: [ITHREAD]

 SATA controller is on IRQ 18.

 ad4: 114473MB Seagate ST3120026AS 3.05 at ata2-master SATA150
 ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150

 ATA channel 2:
 Master:  ad4 ST3120026AS/3.05 Serial ATA v1.0
 Slave:   no device present
 ATA channel 3:
 Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
 Slave:   no device present

 testbox# df -k
 Filesystem  1024-blocksUsed Avail Capacity  Mounted on
 /dev/ad4s1a  507630  23018223683849%/
 devfs 1   1 0   100%/dev
 /dev/ad4s1e  507630  12467008 0%/tmp
 /dev/ad4s1f   108498334 2944826  96873642 3%/usr
 /dev/ad4s1d 2008622   32360   1815574 2%/var
 /dev/ad6s1d   236511738   4 217590796 0%/hotswap

 testbox# vmstat -i
 interrupt  total   rate
 irq4: sio0  1398 34
 irq6: fdc010  0
 irq15: ata1   58  1
 irq18: atapci1   945 23
 irq23: em1 8  0
 cpu0: timer80033   1952
 cpu1: timer79808   1946
 Total 162260   3957

 testbox# umount /hotswap
 testbox# atacontrol detach ata3
 subdisk6: detached
 ad6: detached
 testbox# vmstat -i | grep atapci1
 irq18: atapci1  2671 11

 At this point I wanted to see what happened if I just reattached without
 any physical changes to the SATA bus.

 testbox# atacontrol attach ata3
 ata3: [ITHREAD]
 ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150
 Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
 Slave:   no device present

 

Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-17 Thread Miroslav Lachman

Jeremy Chadwick wrote:

On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:

Today I was replacing disk in one Sun Fire X2100 M2 so I tried  
hot-swapping. It was as you said: atacontrol detach ata3, replace the  
HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3  
times to be sure that it was not coincidence - no panic was produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386  
works.



That's excellent news.  So it seems possibly the problem I was seeing
was with reinit causing some sort of chaos.  I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual!  :-)


Unfortunately there is one problem - I see a lot of interrupts after 
disk swapping (about 193k of atapci1)


Interrupts
197k total
 ohci0 21
 ehci0 22
193k atapci1 23
2001 cpu0: time
   1 bge1 273
2001 cpu1: time

Full output of systat -vm 2 is attached.

It is shown in top as 50% interrupt (CPU state) and load 1 until I 
rebooted the machine (I can provide MRTG graphs). The system was not in 
production load, but almost idle. (I will put it in production tomorrow).

After reboot, everything is OK.

Can somebody test hot-swapping with SATA drives and confirm this 
behavior? (I can't test it now, because machine is in datacenter)


Miroslav Lachman
2 usersLoad  1.00  1.00  0.99  Oct 17 00:25

Mem:KBREALVIRTUAL   VN PAGER   SWAP PAGER
Tot   Share  TotShareFree   in   out in   out
Act   400326212   118412 9352  509928  count
All   702007884  43716700  pages
Proc:Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Fltcow197k total
  3  45  387k6   75 193k  187 zfodohci0 21
  ozfod   ehci0 22
 0.7%Sys  45.9%Intr  0.0%User  0.0%Nice 53.4%Idle%ozfod  193k atapci1 23
|||||||||||   daefr  2001 cpu0: time
+++   prcfr 1 bge1 273
10 dtbuf  totfr  2001 cpu1: time
Namei Name-cache   Dir-cache 68955 desvn  react
   Callshits   %hits   % 58041 numvn  pdwak
 17234 frevn  pdpgs
  intrn
Disks   ad4   ad6  191128 wire
KB/t   0.00  0.00   59664 act
tps   0 0  242588 inact
MB/s   0.00  0.00   46108 cache
%busy 0 0  463820 free
   113488 buf___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-17 Thread Jeremy Chadwick
On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:

 Today I was replacing disk in one Sun Fire X2100 M2 so I tried   
 hot-swapping. It was as you said: atacontrol detach ata3, replace the 
  HDD, atacontrol attach ata3 and new disk is in the system. I tried 
 it 3  times to be sure that it was not coincidence - no panic was 
 produced ;o)
 So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 
 i386  works.


 That's excellent news.  So it seems possibly the problem I was seeing
 was with reinit causing some sort of chaos.  I'll have to check things
 on my testbox here at home to see how I caused the panic last time.

 Thanks for providing feedback, as usual!  :-)

 Unfortunately there is one problem - I see a lot of interrupts after  
 disk swapping (about 193k of atapci1)

 Interrupts
 197k total
  ohci0 21
  ehci0 22
 193k atapci1 23
 2001 cpu0: time
1 bge1 273
 2001 cpu1: time

Okay, so it looks like the interrupt rate on atapci1 after swapping is
going crazy.  What you're showing there looks like heavily modified
vmstat -i output.

 Full output of systat -vm 2 is attached.

 It is shown in top as 50% interrupt (CPU state) and load 1 until I  
 rebooted the machine (I can provide MRTG graphs). The system was not in  
 production load, but almost idle. (I will put it in production tomorrow).
 After reboot, everything is OK.

And this box is running the ATA patch Andrey provided, yes?

 Can somebody test hot-swapping with SATA drives and confirm this  
 behavior? (I can't test it now, because machine is in datacenter)

I can test it on my P4SCE box.

I'll check the interrupt rates after each step of the hot-swap to see
if/when the problem starts.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-17 Thread Miroslav Lachman

Jeremy Chadwick wrote:

On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:


On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:


Today I was replacing disk in one Sun Fire X2100 M2 so I tried   
hot-swapping. It was as you said: atacontrol detach ata3, replace the 
HDD, atacontrol attach ata3 and new disk is in the system. I tried 
it 3  times to be sure that it was not coincidence - no panic was 
produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 
i386  works.



That's excellent news.  So it seems possibly the problem I was seeing
was with reinit causing some sort of chaos.  I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual!  :-)


Unfortunately there is one problem - I see a lot of interrupts after  
disk swapping (about 193k of atapci1)


Interrupts
197k total
ohci0 21
ehci0 22
193k atapci1 23
2001 cpu0: time
  1 bge1 273
2001 cpu1: time



Okay, so it looks like the interrupt rate on atapci1 after swapping is
going crazy.  What you're showing there looks like heavily modified
vmstat -i output.


The shown is manually cropped from systat -vm, I'll try vmstat -i next 
time. ;)



Full output of systat -vm 2 is attached.

It is shown in top as 50% interrupt (CPU state) and load 1 until I  
rebooted the machine (I can provide MRTG graphs). The system was not in  
production load, but almost idle. (I will put it in production tomorrow).

After reboot, everything is OK.



And this box is running the ATA patch Andrey provided, yes?


It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.

Can somebody test hot-swapping with SATA drives and confirm this  
behavior? (I can't test it now, because machine is in datacenter)



I can test it on my P4SCE box.

I'll check the interrupt rates after each step of the hot-swap to see
if/when the problem starts.


I'll check the interrupts next time too and will post results to this 
thread.


Miroslav Lachman
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-17 Thread Jeremy Chadwick
On Fri, Oct 17, 2008 at 04:09:17PM +0200, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:


 Today I was replacing disk in one Sun Fire X2100 M2 so I tried
 hot-swapping. It was as you said: atacontrol detach ata3, replace 
 the HDD, atacontrol attach ata3 and new disk is in the system. I 
 tried it 3  times to be sure that it was not coincidence - no 
 panic was produced ;o)
 So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 
 7.0 i386  works.


 That's excellent news.  So it seems possibly the problem I was seeing
 was with reinit causing some sort of chaos.  I'll have to check things
 on my testbox here at home to see how I caused the panic last time.

 Thanks for providing feedback, as usual!  :-)

 Unfortunately there is one problem - I see a lot of interrupts after  
 disk swapping (about 193k of atapci1)

 Interrupts
 197k total
 ohci0 21
 ehci0 22
 193k atapci1 23
 2001 cpu0: time
   1 bge1 273
 2001 cpu1: time


 Okay, so it looks like the interrupt rate on atapci1 after swapping is
 going crazy.  What you're showing there looks like heavily modified
 vmstat -i output.

 The shown is manually cropped from systat -vm, I'll try vmstat -i next  
 time. ;)

 Full output of systat -vm 2 is attached.

 It is shown in top as 50% interrupt (CPU state) and load 1 until I   
 rebooted the machine (I can provide MRTG graphs). The system was not 
 in  production load, but almost idle. (I will put it in production 
 tomorrow).
 After reboot, everything is OK.


 And this box is running the ATA patch Andrey provided, yes?

 It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.

 Can somebody test hot-swapping with SATA drives and confirm this   
 behavior? (I can't test it now, because machine is in datacenter)


 I can test it on my P4SCE box.

 I'll check the interrupt rates after each step of the hot-swap to see
 if/when the problem starts.

 I'll check the interrupts next time too and will post results to this  
 thread.

As promised, here are notes from my testing:


First thing to note is that the BIOS on my P4SCE had the ICH5 SATA mode
set to Auto, which was causing PATA emulation to happen on the SATA
controller, e.g.  disk #0 == ata0-master, disk #1 == ata0-slave.

I changed the BIOS option from Auto to SATA Enhanced, and now the
disks show up on their own channels, e.g.  disk #0 == ata2-master, disk
#1 == ata3-master.

Here's the applicable data.  Note that this kernel ***DOES*** include
Andrey's ATA patch:

FreeBSD testbox.home.lan 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Thu Oct 16 
10:56:42 PDT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/TESTBOX  i386

atapci1: Intel ICH5 SATA150 controller port 
0xc000-0xc007,0xc400-0xc403,0xc800-0xc807,0xcc00-0xcc03,0xd000-0xd00f irq 18 at 
device 31.2 on pci0
atapci1: [ITHREAD]
ata2: ATA channel 0 on atapci1
ata2: [ITHREAD]
ata3: ATA channel 1 on atapci1
ata3: [ITHREAD]

SATA controller is on IRQ 18.

ad4: 114473MB Seagate ST3120026AS 3.05 at ata2-master SATA150
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150

ATA channel 2:
Master:  ad4 ST3120026AS/3.05 Serial ATA v1.0
Slave:   no device present
ATA channel 3:
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# df -k
Filesystem  1024-blocksUsed Avail Capacity  Mounted on
/dev/ad4s1a  507630  23018223683849%/
devfs 1   1 0   100%/dev
/dev/ad4s1e  507630  12467008 0%/tmp
/dev/ad4s1f   108498334 2944826  96873642 3%/usr
/dev/ad4s1d 2008622   32360   1815574 2%/var
/dev/ad6s1d   236511738   4 217590796 0%/hotswap

testbox# vmstat -i
interrupt  total   rate
irq4: sio0  1398 34
irq6: fdc010  0
irq15: ata1   58  1
irq18: atapci1   945 23
irq23: em1 8  0
cpu0: timer80033   1952
cpu1: timer79808   1946
Total 162260   3957

testbox# umount /hotswap
testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached
testbox# vmstat -i | grep atapci1
irq18: atapci1  2671 11

At this point I wanted to see what happened if I just reattached without
any physical changes to the SATA bus.

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# vmstat -i | grep atapci1
irq18: atapci1  2764  9
testbox# mount /dev/ad6s1d /hotswap
testbox# vmstat -i | grep atapci1

Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-17 Thread Miroslav Lachman

Jeremy Chadwick wrote:

On Fri, Oct 17, 2008 at 04:09:17PM +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:


On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:


On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:



Today I was replacing disk in one Sun Fire X2100 M2 so I tried
hot-swapping. It was as you said: atacontrol detach ata3, replace 
the HDD, atacontrol attach ata3 and new disk is in the system. I 
tried it 3  times to be sure that it was not coincidence - no 
panic was produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 
7.0 i386  works.



That's excellent news.  So it seems possibly the problem I was seeing
was with reinit causing some sort of chaos.  I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual!  :-)


Unfortunately there is one problem - I see a lot of interrupts after  
disk swapping (about 193k of atapci1)


Interrupts
197k total
   ohci0 21
   ehci0 22
193k atapci1 23
2001 cpu0: time
 1 bge1 273
2001 cpu1: time



Okay, so it looks like the interrupt rate on atapci1 after swapping is
going crazy.  What you're showing there looks like heavily modified
vmstat -i output.


The shown is manually cropped from systat -vm, I'll try vmstat -i next  
time. ;)




Full output of systat -vm 2 is attached.

It is shown in top as 50% interrupt (CPU state) and load 1 until I   
rebooted the machine (I can provide MRTG graphs). The system was not 
in  production load, but almost idle. (I will put it in production 
tomorrow).

After reboot, everything is OK.



And this box is running the ATA patch Andrey provided, yes?


It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.


Can somebody test hot-swapping with SATA drives and confirm this   
behavior? (I can't test it now, because machine is in datacenter)



I can test it on my P4SCE box.

I'll check the interrupt rates after each step of the hot-swap to see
if/when the problem starts.


I'll check the interrupts next time too and will post results to this  
thread.



As promised, here are notes from my testing:


First thing to note is that the BIOS on my P4SCE had the ICH5 SATA mode
set to Auto, which was causing PATA emulation to happen on the SATA
controller, e.g.  disk #0 == ata0-master, disk #1 == ata0-slave.

I changed the BIOS option from Auto to SATA Enhanced, and now the
disks show up on their own channels, e.g.  disk #0 == ata2-master, disk
#1 == ata3-master.

Here's the applicable data.  Note that this kernel ***DOES*** include
Andrey's ATA patch:

FreeBSD testbox.home.lan 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Thu Oct 16 
10:56:42 PDT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/TESTBOX  i386

atapci1: Intel ICH5 SATA150 controller port 
0xc000-0xc007,0xc400-0xc403,0xc800-0xc807,0xcc00-0xcc03,0xd000-0xd00f irq 18 at 
device 31.2 on pci0
atapci1: [ITHREAD]
ata2: ATA channel 0 on atapci1
ata2: [ITHREAD]
ata3: ATA channel 1 on atapci1
ata3: [ITHREAD]

SATA controller is on IRQ 18.

ad4: 114473MB Seagate ST3120026AS 3.05 at ata2-master SATA150
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150

ATA channel 2:
Master:  ad4 ST3120026AS/3.05 Serial ATA v1.0
Slave:   no device present
ATA channel 3:
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# df -k
Filesystem  1024-blocksUsed Avail Capacity  Mounted on
/dev/ad4s1a  507630  23018223683849%/
devfs 1   1 0   100%/dev
/dev/ad4s1e  507630  12467008 0%/tmp
/dev/ad4s1f   108498334 2944826  96873642 3%/usr
/dev/ad4s1d 2008622   32360   1815574 2%/var
/dev/ad6s1d   236511738   4 217590796 0%/hotswap

testbox# vmstat -i
interrupt  total   rate
irq4: sio0  1398 34
irq6: fdc010  0
irq15: ata1   58  1
irq18: atapci1   945 23
irq23: em1 8  0
cpu0: timer80033   1952
cpu1: timer79808   1946
Total 162260   3957

testbox# umount /hotswap
testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached
testbox# vmstat -i | grep atapci1
irq18: atapci1  2671 11

At this point I wanted to see what happened if I just reattached without
any physical changes to the SATA bus.

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 238474MB WDC WD2500KS-00MJB0 02.01C03 at ata3-master SATA150
Master:  ad6 WDC WD2500KS-00MJB0/02.01C03 Serial ATA II
Slave:   no device present

testbox# vmstat -i | grep atapci1
irq18: atapci1  2764  9
testbox# mount /dev/ad6s1d /hotswap
testbox# vmstat -i | grep 

Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-16 Thread Miroslav Lachman

Jeremy Chadwick wrote:

On Mon, Sep 29, 2008 at 05:25:32PM +0200, Miroslav Lachman wrote:

It was about year ago with Asus and Sun Fire X2100. I don't have Asus  
servers now (all returned as reclamation). Now I am running one X2100  
and about ten X2100 M2. I have one spare X2100 M2, so if somebody have  
exact order of commands used to hot-swap the disk, I can test it in  
few days.



I believe the correct order of operation is to do a detach on the
channel before physically removing the disk, insert the new disk, then
do attach on the same channel.  list should be done afterwards to
ensure the new disk shows up.

If you want me to verify for certain, I have a test box built in the
other room which has a SATA hot-swap backplane on it.

I've also seen cases where the attach works, but upon doing list,
the old disk ID/string is still shown.  In this case I had to do a
detach, remove the disk, insert the new disk, reinit, then an
attach for things to work.

Finally, I've also seen the kernel panic or hard-lock after running
reinit, but this may have had something to do with Intel MatrixRAID.


Today I was replacing disk in one Sun Fire X2100 M2 so I tried 
hot-swapping. It was as you said: atacontrol detach ata3, replace the 
HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3 
times to be sure that it was not coincidence - no panic was produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386 
works.


Miroslav Lachman


# atacontrol list
ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 1:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  ad4 Hitachi HDP725050GLA360/GM4OA52A Serial ATA II
Slave:   no device present
ATA channel 3:
Master:  ad6 Hitachi HDP725050GLA360/GM4OA52A Serial ATA II
Slave:   no device present

# atacontrol detach ata3
subdisk6: detached
ad6: detached
GEOM_MIRROR: Device gm0: provider ad6 disconnected

# atacontrol list
ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 1:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  ad4 Hitachi HDP725050GLA360/GM4OA52A Serial ATA II
Slave:   no device present
ATA channel 3:
Master:  no device present
Slave:   no device present

## [old disk was physically removed]

## [new disk was physically inserted]

# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 953869MB SAMSUNG HD103UJ 1AA01113 at ata3-master SATA300
Master:  ad6 SAMSUBF HD103UJ/1AA01113 Serial ATA II
Slave:   no device present

# atacontrol list
ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 1:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  ad4 Hitachi HDP725050GLA360/GM4OA52A Serial ATA II
Slave:   no device present
ATA channel 3:
Master:  ad6 SAMSUNG HD103UJ/1AA01113 Serial ATA II
Slave:   no device present

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives [hot-swap]

2008-10-16 Thread Jeremy Chadwick
On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:
 Today I was replacing disk in one Sun Fire X2100 M2 so I tried  
 hot-swapping. It was as you said: atacontrol detach ata3, replace the  
 HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3  
 times to be sure that it was not coincidence - no panic was produced ;o)
 So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386  
 works.

That's excellent news.  So it seems possibly the problem I was seeing
was with reinit causing some sort of chaos.  I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual!  :-)

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives

2008-09-29 Thread Miroslav Lachman

Jeremy Chadwick wrote:


On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:


On Fri, 26 Sep 2008, Jeremy Chadwick wrote:

[...]

This also leads me a little off-topic -- when it comes to disk
replacements, administrators want to be able to do this without taking
the system down.  There are problems with this, but it often depends
greatly on hardware and BIOS configuration.

I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
AHCI in use, SATA2 disks), but it required me to issue atacontrol
detach first (I am very curious to know what would've happened had I
just yanked the disk).  Upon inserting the new disk, one has to be
*very* careful about the order of atacontrol commands given -- there
are cases where attach will cause the system to panic or SATA bus to
lock up, but it seems to depend upon what commands were executed
previously (such as reinit).

Sorry if this is off-topic, but I wanted to mention it.


Hot-swapping is totally upredictable on FreeBSD (from my experiences). I 
tried it many times on Asus 1U servers and on Sun Fire X2100 / X2100 M2 
with FreeBSD 6.2 and 7.0 (both i386). It sometimes panics on atacontrol 
detach, but never panics if disk was marked as failed by gmirror and 
detached by system it-self, then just removed from running machine. It 
sometimes panics immediately after the re-insertion of disk, sometimes 
after atacontrol attach. Sometimes it detects and attach disk without my 
intervention, so I can easily insert the disk in to gmirror.
Then I stopped playing with hot-swapping and now always do power off 
before disk swapping.


Miroslav Lachman
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives

2008-09-29 Thread Gavin Atkinson
On Mon, 2008-09-29 at 15:43 +0200, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 
  On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:
  
  I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
  AHCI in use, SATA2 disks), but it required me to issue atacontrol
  detach first (I am very curious to know what would've happened had I
  just yanked the disk).  Upon inserting the new disk, one has to be
  *very* careful about the order of atacontrol commands given -- there
  are cases where attach will cause the system to panic or SATA bus to
  lock up, but it seems to depend upon what commands were executed
  previously (such as reinit).
  
  Sorry if this is off-topic, but I wanted to mention it.
 
 Hot-swapping is totally upredictable on FreeBSD (from my experiences). I 
 tried it many times on Asus 1U servers and on Sun Fire X2100 / X2100 M2 
 with FreeBSD 6.2 and 7.0 (both i386).

I can't speak for the Dell, but I can at least say that at least on the
X2100, not even Solaris supports either hot-swapping or the built in
software RAID.  When they were first released the advertising said that
they had these, but those claims was quietly removed from the website
some weeks after release.  Short answer: give up on hot-swap the X2100.

As for the X2100 M2, that is supposed to support it, and I believe it
works fine for us under Solaris.  I'm not sure if I've got any spare
M2's here, if so I'll have a play.

Gavin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives [hot-swap]

2008-09-29 Thread Miroslav Lachman

Gavin Atkinson wrote:

On Mon, 2008-09-29 at 15:43 +0200, Miroslav Lachman wrote:


Jeremy Chadwick wrote:



On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:

I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
AHCI in use, SATA2 disks), but it required me to issue atacontrol
detach first (I am very curious to know what would've happened had I
just yanked the disk).  Upon inserting the new disk, one has to be
*very* careful about the order of atacontrol commands given -- there
are cases where attach will cause the system to panic or SATA bus to
lock up, but it seems to depend upon what commands were executed
previously (such as reinit).

Sorry if this is off-topic, but I wanted to mention it.


Hot-swapping is totally upredictable on FreeBSD (from my experiences). I 
tried it many times on Asus 1U servers and on Sun Fire X2100 / X2100 M2 
with FreeBSD 6.2 and 7.0 (both i386).



I can't speak for the Dell, but I can at least say that at least on the
X2100, not even Solaris supports either hot-swapping or the built in
software RAID.  When they were first released the advertising said that
they had these, but those claims was quietly removed from the website
some weeks after release.  Short answer: give up on hot-swap the X2100.

As for the X2100 M2, that is supposed to support it, and I believe it
works fine for us under Solaris.  I'm not sure if I've got any spare
M2's here, if so I'll have a play.


It was about year ago with Asus and Sun Fire X2100. I don't have Asus 
servers now (all returned as reclamation). Now I am running one X2100 
and about ten X2100 M2. I have one spare X2100 M2, so if somebody have 
exact order of commands used to hot-swap the disk, I can test it in 
few days.


Miroslav Lachman
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recommendations for servers running SATA drives [hot-swap]

2008-09-29 Thread Jeremy Chadwick
On Mon, Sep 29, 2008 at 05:25:32PM +0200, Miroslav Lachman wrote:
 It was about year ago with Asus and Sun Fire X2100. I don't have Asus  
 servers now (all returned as reclamation). Now I am running one X2100  
 and about ten X2100 M2. I have one spare X2100 M2, so if somebody have  
 exact order of commands used to hot-swap the disk, I can test it in  
 few days.

I believe the correct order of operation is to do a detach on the
channel before physically removing the disk, insert the new disk, then
do attach on the same channel.  list should be done afterwards to
ensure the new disk shows up.

If you want me to verify for certain, I have a test box built in the
other room which has a SATA hot-swap backplane on it.

I've also seen cases where the attach works, but upon doing list,
the old disk ID/string is still shown.  In this case I had to do a
detach, remove the disk, insert the new disk, reinit, then an
attach for things to work.

Finally, I've also seen the kernel panic or hard-lock after running
reinit, but this may have had something to do with Intel MatrixRAID.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Recommendations for servers running SATA drives

2008-09-27 Thread Charles Sprickman
I'm forking the thread on fsck/soft-updates in hopes of getting some 
practical advice based on the discussion here of background fsck, 
softupdates and write-caching on SATA drives.


On Fri, 26 Sep 2008, Jeremy Chadwick wrote:


Let's be realistic.  We're talking about ATA and SATA hard disks, hooked
up to on-board controllers -- these are the majority of users.  Those
with ATA/SATA RAID controllers (not on-board RAID either; most/all of
those do not let you disable drive write caching) *might* have a RAID
BIOS menu item for disabling said feature.


While I would love to deploy every server with SAS, that's not practical 
in many cases, especially for light-duty servers that are not being pushed 
very hard.  I am taking my chances with multiple affordable drives and 
gmirror where I cannot throw in a 3Ware card.  I imagine that many 
non-desktop FreeBSD users are doing the same considering you can fetch a 
decent 1U box with plenty of storage for not much more than $1K.  I assume 
many here are in agreement on this point -- just making it clear that the 
bargain crowd is not some weird edge case in the userbase...



Regardless of all of this, end-users should, in no way shape or form,
be expected to go to great lengths to disable their disk's write cache.
They will not, I can assure you.  Thus, we must assume: write caching
on a disk will be enabled, period.  If a filesystem is engineered with
that fact ignored, then the filesystem is either 1) worthless, or 2)
serves a very niche purpose and should not be the default filesystem.


Arguments about defaults aside, this is my first questions.  If I've got a 
server with multiple SATA drives mirrored with gmirror, is turning on 
write-caching a good idea?  What kind of performance impact should I 
expect?  What is the relationship between caching, soft-updates, and 
either NCQ or TCQ?


Here's an example of a Seagate, trimmed for brevity:

Protocol  Serial ATA v1.0
device model  ST3160811AS

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Native Command Queuing (NCQ)   yes   -  31/0x1F
Tagged Command Queuing (TCQ)   no   no  31/0x1F

TCQ is clearly not supported, NCQ seems to be supported, but I don't know 
how to tell if it's actually enabled or not.  Write-caching is currently 
on.


The tradeoff is apparently performance vs. more reliable recovery should 
the machine lose power, smoke itself, etc., but all I've seen is anecdotal 
evidence of how bad performance gets.


FWIW, this machine in particular had it's mainboard go up in smoke last 
week.  One drive was too far gone for gmirror to rebuild it without doing 
a forget and insert.  The remaining drive was too screwy for 
background fsck, but a manual check in single-user left me with no real 
suprises or problems.



The system is already up and the filesystems mounted.  If the error in
question is of such severity that it would impact a user's ability to
reliably use the filesystem, how do you expect constant screaming on
the console will help?  A user won't know what it means; there is
already evidence of this happening (re: mysterious ATA DMA errors which
still cannot be figured out[6]).

IMHO, a dirty filesystem should not be mounted until it's been fully
analysed/scanned by fsck.  So again, people are putting faith into
UFS2+SU despite actual evidence proving that it doesn't handle all
scenarios.


I'll ask, but it seems like the consensus here is that background fsck, 
while the default, is best left disabled.  The cases where it might make 
sense are:


-desktop systems
-servers that have incredibly huge filesystems (and even there being able 
to selectively background fsck filesystems might be helpful)


The first example is obvious, people want a fast-booting desktop.  The 
second is trading long fsck times in single-user for some uncertainty.



The problem here is that when it was created, it was sort of an
experiment.  Now, when someone installs FreeBSD, UFS2 is the default
filesystem used, and SU are enabled on every filesystem except the root
fs.  Thus, we have now put ourselves into a situation where said
feature ***must*** be reliable in all cases.

You're also forgetting a huge focus of SU -- snapshots[1].  However, there
are more than enough facts on the table at this point concluding that
snapshots are causing more problems[7] than previously expected.  And
there's further evidence filesystem snapshots shouldn't even be used in
this way[8].


...


Filesystems have to be reliable; data integrity is focus #1, and cannot
be sacrificed.  Users and administrators *expect* a filesystem to be
reliable.  No one is going to keep using a filesystem if it has
disadvantages which can result in data loss or waste of administrative
time (which I believe is what's occurring here).


The softupdates question seems tied quite closely to the 

Re: Recommendations for servers running SATA drives

2008-09-27 Thread Jeremy Chadwick
On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:
 On Fri, 26 Sep 2008, Jeremy Chadwick wrote:
 Let's be realistic.  We're talking about ATA and SATA hard disks, hooked
 up to on-board controllers -- these are the majority of users.  Those
 with ATA/SATA RAID controllers (not on-board RAID either; most/all of
 those do not let you disable drive write caching) *might* have a RAID
 BIOS menu item for disabling said feature.

 While I would love to deploy every server with SAS, that's not practical  
 in many cases, especially for light-duty servers that are not being 
 pushed very hard.  I am taking my chances with multiple affordable drives 
 and gmirror where I cannot throw in a 3Ware card.  I imagine that many  
 non-desktop FreeBSD users are doing the same considering you can fetch a  
 decent 1U box with plenty of storage for not much more than $1K.  I 
 assume many here are in agreement on this point -- just making it clear 
 that the bargain crowd is not some weird edge case in the userbase...

I'm in full agreement here.  As much as I love SCSI (and I sincerely do)
it's (IMHO unjustifiably) overpriced, simply because it can be.  You'd
expect the price of SCSI to decrease over the years, but it hasn't; it's
become part of a niche market, primarily intended for large businesses
with cash to blow.  As I said, I love SCSI, the protocol is excellent,
and it's very well-supported all over the place -- and though I have
no personal experience with SAS, it appears to be equally as excellent,
yet the price is comparative to SCSI.

Even at my place of work we use SATA disks in our filers.  I suppose this
is justified in the sense that a disk failure there will be less painful
than it would be in a single or dual-disk server, so saving money is
legitimate since RAID-5 (or whatever) is in use.  But with regards to
our server boxes, either single or dual SATA disks are now being used,
rather than SCSI.  I haven't asked our datacenter and engineering folks
why we've switched, but gut feeling says saving money

 Regardless of all of this, end-users should, in no way shape or form,
 be expected to go to great lengths to disable their disk's write cache.
 They will not, I can assure you.  Thus, we must assume: write caching
 on a disk will be enabled, period.  If a filesystem is engineered with
 that fact ignored, then the filesystem is either 1) worthless, or 2)
 serves a very niche purpose and should not be the default filesystem.

 Arguments about defaults aside, this is my first questions.  If I've got 
 a server with multiple SATA drives mirrored with gmirror, is turning on  
 write-caching a good idea?  What kind of performance impact should I  
 expect?  What is the relationship between caching, soft-updates, and  
 either NCQ or TCQ?

 Here's an example of a Seagate, trimmed for brevity:

 Protocol  Serial ATA v1.0
 device model  ST3160811AS

 Feature  Support  EnableValue   Vendor
 write cacheyes  yes
 read ahead yes  yes
 Native Command Queuing (NCQ)   yes   -  31/0x1F
 Tagged Command Queuing (TCQ)   no   no  31/0x1F

 TCQ is clearly not supported, NCQ seems to be supported, but I don't know 
 how to tell if it's actually enabled or not.  Write-caching is currently  
 on.

Actually, no -- FreeBSD ata(4) does not support NCQ.  I believe there
are some unofficial patches (or even a PR) floating around which are for
testing, but out of the box, it lacks support.  The hyphen you see under
the Enable column is supposed to signify that (I feel it's badly placed;
it should say notsupp or unsupp or something like that.  Hyphen is
too vague).

The NCQ support patches might require AHCI as well, I forget.  It's been
a while.

 The tradeoff is apparently performance vs. more reliable recovery should  
 the machine lose power, smoke itself, etc., but all I've seen is 
 anecdotal evidence of how bad performance gets.

 FWIW, this machine in particular had it's mainboard go up in smoke last  
 week.  One drive was too far gone for gmirror to rebuild it without doing 
 a forget and insert.  The remaining drive was too screwy for  
 background fsck, but a manual check in single-user left me with no real  
 suprises or problems.

As long as the array rebuilt fine, I believe small quirks are
acceptable.  Scenarios where the array *doesn't* rebuild properly when a
new disk is added are of great concern (and in the case of some features
such as Intel MatrixRAID, the FreeBSD bugs are so severe that you are
liable to lose data in such scenarios.  MatrixRAID != gmirror, of
course).

This also leads me a little off-topic -- when it comes to disk
replacements, administrators want to be able to do this without taking
the system down.  There are problems with this, but it often depends
greatly on hardware and BIOS configuration.

I've successfully done a hot-swap (hardware: SATA hot-swap backplane,