Re: disk flipped - a known problem?
Andriy Gapon a...@freebsd.org wrote: Today something unusual happened on one of my machines: kernel: (ada0:ahcich0:0:0:0): lost device kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected flags 0x18 refcount 1 kernel: adaasync: Unable to attach to new device due to status 0x6 I believe I saw something similar when trying to forcefully end the cam lockups reported in: http://lists.freebsd.org/pipermail/freebsd-current/2012-October/037413.html Detaching the disc drive caused /dev/cd0 to disappear as expected, but reinserting the drive didn't bring cd0 back. It looks like the disk disappeared from the bus and then re-appeared on the bus, but not to the OS. One of the partitions that the disk hosted was a swap partition and it seems to be the cause of some of the following consequences. The consequences: [...] * geom_event thread started consuming 100% of CPU in g_wither_washer() This sounds familiar as well: http://www.freebsd.org/cgi/query-pr.cgi?pr=171865 Fabian signature.asc Description: PGP signature
Re: disk flipped - a known problem?
Hi! On Sunday 20 January 2013 20:00:15 Andriy Gapon wrote: Today something unusual happened on one of my machines: kernel: (ada0:ahcich0:0:0:0): lost device kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected flags 0x18 refcount 1 kernel: adaasync: Unable to attach to new device due to status 0x6 It looks like the disk disappeared from the bus and then re-appeared on the bus, but not to the OS. One of the partitions that the disk hosted was a swap partition and it seems to be the cause of some of the following consequences. The consequences: * ZFS properly noticed disappearance of the disk, but its diagnostic was a little bit misleading: pool: pond state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012 config: NAMESTATE READ WRITE CKSUM pondDEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 12725235722288301230REMOVED 0 0 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0 0 0 Yes, I agree that the disk got removed/lost, but disagree that the administrator did it. * geom_event thread started consuming 100% of CPU in g_wither_washer() * /dev/ada0 disappeared but camcontrol devlist still reported ada0: ST3500410AS CC34 at scbus0 target 0 lun 0 (pass0,ada0) * As seen in the system messages, CAM layer refused to re-attach the disk * gpart command would just crash So, I can explain the behavior of the geom_event thread - apparently swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so g_wither_washer is stuck waiting until the swap consumer goes way (drops its access bits). (Another sad thing about this state is that I couldn't swapoff the device, because there was no device entry.) I am not sure if the attempt to re-allocate valid device failure was caused by this, but it could be, if something in CAM layer was waiting for GEOM layer to be done with the disk. It would be nice if the swap code properly supported disappearance of the underlying disks. Especially in this case where the swap was actually never used / touched at all (few hours after reboot and completely idle system). I don't know if it's related, but my new 2 TB WD green harddisk vanished three times during the last couple of weeks, too, Some guys over there at hackers@ told me that that might be due to bad blocks on the disk, but unfortunately (or luckily?) neither of the smart tests did find any errors :-(. So I wonder if there's a hardware or software problem. That happened on 9.1 stable when I was copying data from/to that harddisk (UFS). Ciao, Christian. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
disk flipped - a known problem?
Today something unusual happened on one of my machines: kernel: (ada0:ahcich0:0:0:0): lost device kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected flags 0x18 refcount 1 kernel: adaasync: Unable to attach to new device due to status 0x6 It looks like the disk disappeared from the bus and then re-appeared on the bus, but not to the OS. One of the partitions that the disk hosted was a swap partition and it seems to be the cause of some of the following consequences. The consequences: * ZFS properly noticed disappearance of the disk, but its diagnostic was a little bit misleading: pool: pond state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012 config: NAMESTATE READ WRITE CKSUM pondDEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 12725235722288301230REMOVED 0 0 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0 0 0 Yes, I agree that the disk got removed/lost, but disagree that the administrator did it. * geom_event thread started consuming 100% of CPU in g_wither_washer() * /dev/ada0 disappeared but camcontrol devlist still reported ada0: ST3500410AS CC34 at scbus0 target 0 lun 0 (pass0,ada0) * As seen in the system messages, CAM layer refused to re-attach the disk * gpart command would just crash So, I can explain the behavior of the geom_event thread - apparently swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so g_wither_washer is stuck waiting until the swap consumer goes way (drops its access bits). (Another sad thing about this state is that I couldn't swapoff the device, because there was no device entry.) I am not sure if the attempt to re-allocate valid device failure was caused by this, but it could be, if something in CAM layer was waiting for GEOM layer to be done with the disk. It would be nice if the swap code properly supported disappearance of the underlying disks. Especially in this case where the swap was actually never used / touched at all (few hours after reboot and completely idle system). -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org