Re: disk flipped - a known problem?

2013-01-21 Thread Fabian Keil
Andriy Gapon a...@freebsd.org wrote:

 Today something unusual happened on one of my machines:
 kernel: (ada0:ahcich0:0:0:0): lost device
 kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 
 00
 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
 kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
 kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 
 00
 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
 kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
 kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
 flags 0x18 refcount 1
 kernel: adaasync: Unable to attach to new device due to status 0x6

I believe I saw something similar when trying to forcefully
end the cam lockups reported in:
http://lists.freebsd.org/pipermail/freebsd-current/2012-October/037413.html

Detaching the disc drive caused /dev/cd0 to disappear as expected,
but reinserting the drive didn't bring cd0 back.

 It looks like the disk disappeared from the bus and then re-appeared on the 
 bus,
 but not to the OS.
 
 One of the partitions that the disk hosted was a swap partition and it seems 
 to
 be the cause of some of the following consequences.
 
 The consequences:
[...]
 * geom_event thread started consuming 100% of CPU in g_wither_washer()

This sounds familiar as well:
http://www.freebsd.org/cgi/query-pr.cgi?pr=171865

Fabian


signature.asc
Description: PGP signature


Re: disk flipped - a known problem?

2013-01-21 Thread Christian Gusenbauer
Hi!

On Sunday 20 January 2013 20:00:15 Andriy Gapon wrote:
 Today something unusual happened on one of my machines:
 kernel: (ada0:ahcich0:0:0:0): lost device
 kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
 kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
 kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
 kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
 kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
 flags 0x18 refcount 1
 kernel: adaasync: Unable to attach to new device due to status 0x6
 
 It looks like the disk disappeared from the bus and then re-appeared on the
 bus, but not to the OS.
 
 One of the partitions that the disk hosted was a swap partition and it
 seems to be the cause of some of the following consequences.
 
 The consequences:
 
 * ZFS properly noticed disappearance of the disk, but its diagnostic was a
 little bit misleading:
 
   pool: pond
  state: DEGRADED
 status: One or more devices has been removed by the administrator.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
 action: Online the device using 'zpool online' or replace the device with
 'zpool replace'.
   scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012
 config:
 
 NAMESTATE READ
 WRITE CKSUM pondDEGRADED 0
 0 0 mirror-0  DEGRADED 0  
   0 0 12725235722288301230REMOVED  0 0
 0  was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff
 gptid/48782c6e-8fbd-11de-b3e1-00241d20d446  ONLINE   0
 0 0
 
 Yes, I agree that the disk got removed/lost, but disagree that the
 administrator did it.
 
 * geom_event thread started consuming 100% of CPU in g_wither_washer()
 
 * /dev/ada0 disappeared but camcontrol devlist still reported ada0:
 ST3500410AS CC34 at scbus0 target 0 lun 0 (pass0,ada0)
 
 * As seen in the system messages, CAM layer refused to re-attach the disk
 
 * gpart command would just crash
 
 
 So, I can explain the behavior of the geom_event thread - apparently
 swapgeom_orphan doesn't do anything that is really meaningful to GEOM and
 so g_wither_washer is stuck waiting until the swap consumer goes way
 (drops its access bits).
 
 (Another sad thing about this state is that I couldn't swapoff the device,
 because there was no device entry.)
 
 I am not sure if the attempt to re-allocate valid device failure was
 caused by this, but it could be, if something in CAM layer was waiting for
 GEOM layer to be done with the disk.
 
 It would be nice if the swap code properly supported disappearance of the
 underlying disks.  Especially in this case where the swap was actually
 never used / touched at all (few hours after reboot and completely idle
 system).

I don't know if it's related, but my new 2 TB WD green harddisk vanished three 
times during the last couple of weeks, too, Some guys over there at hackers@ 
told me that that might be due to bad blocks on the disk, but unfortunately 
(or luckily?) neither of the smart tests did find any errors :-(. So I wonder 
if there's a hardware or software problem. That happened on 9.1 stable when I 
was copying data from/to that harddisk (UFS).

Ciao,
Christian.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


disk flipped - a known problem?

2013-01-20 Thread Andriy Gapon

Today something unusual happened on one of my machines:
kernel: (ada0:ahcich0:0:0:0): lost device
kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
flags 0x18 refcount 1
kernel: adaasync: Unable to attach to new device due to status 0x6

It looks like the disk disappeared from the bus and then re-appeared on the bus,
but not to the OS.

One of the partitions that the disk hosted was a swap partition and it seems to
be the cause of some of the following consequences.

The consequences:

* ZFS properly noticed disappearance of the disk, but its diagnostic was a
little bit misleading:

  pool: pond
 state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
  scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012
config:

NAMESTATE READ WRITE 
CKSUM
pondDEGRADED 0 0
 0
  mirror-0  DEGRADED 0 0
 0
12725235722288301230REMOVED  0 0
 0  was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff
gptid/48782c6e-8fbd-11de-b3e1-00241d20d446  ONLINE   0 0
 0

Yes, I agree that the disk got removed/lost, but disagree that the
administrator did it.

* geom_event thread started consuming 100% of CPU in g_wither_washer()

* /dev/ada0 disappeared but camcontrol devlist still reported ada0:
ST3500410AS CC34 at scbus0 target 0 lun 0 (pass0,ada0)

* As seen in the system messages, CAM layer refused to re-attach the disk

* gpart command would just crash


So, I can explain the behavior of the geom_event thread - apparently
swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so
g_wither_washer is stuck waiting until the swap consumer goes way (drops its
access bits).

(Another sad thing about this state is that I couldn't swapoff the device,
because there was no device entry.)

I am not sure if the attempt to re-allocate valid device failure was caused by
this, but it could be, if something in CAM layer was waiting for GEOM layer to
be done with the disk.

It would be nice if the swap code properly supported disappearance of the
underlying disks.  Especially in this case where the swap was actually never
used / touched at all (few hours after reboot and completely idle system).

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org