smartctl question
Hi all, one single disk in a zfs mirror failed permanently throwing errors like kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT ) and alike. The pool itself continued working degraded, smartctl showed a very high 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a broken cable, in this case indeed a cable replacement solved the problem, the pool resilvered and all is fine. Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon to be way too high, though ( 3900 ) . So is this value now including errors from previous broken cable ? In other words, when, if at all, is the cache smartmontools read from flushed and values are to be taken as of the status after fixing a hardware problem but not swapping the disk ? Can someone please share some insight ? thanks ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: smartctl question
Hi, On 2012.11.09 12:18, H. Ingow wrote: Hi all, one single disk in a zfs mirror failed permanently throwing errors like kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT ) and alike. The pool itself continued working degraded, smartctl showed a very high 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a broken cable, in this case indeed a cable replacement solved the problem, the pool resilvered and all is fine. Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon to be way too high, though ( 3900 ) . So is this value now including errors from previous broken cable ? I'm pretty sure it is. I don't think SMART attributes can vary in value both up and down ; they seem to me like they're counters that can only get incremented. In other words, when, if at all, is the cache smartmontools read from flushed and values are to be taken as of the status after fixing a hardware problem but not swapping the disk ? So, in my opinion no. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: buildworld fails on recent stable
Thanks, looking forward to MFC! -- View this message in context: http://freebsd.1045724.n5.nabble.com/buildworld-fails-on-recent-stable-tp5758273p5759526.html Sent from the freebsd-stable mailing list archive at Nabble.com. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: buildworld fails on recent stable
On Fri, Nov 09, 2012 at 12:28:11AM +0100, Dimitry Andric wrote: ... I have also looked at merging the snapshot of 3.2 we now have in head to stable/9, but it is also quite some work, so I found a better solution: I managed to shrink boot2 by enough bytes to make it fit again. I committed the change to head in r242804, and I will MFC it in 3 days, if there are no regressions reported. Meanwhile, please apply the attached patch. Works for me -- I'm now running: FreeBSD g1-227.catwhisker.org 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #292 242822M: Fri Nov 9 04:26:20 PST 2012 r...@g1-227.catwhisker.org:/usr/obj/usr/src/sys/CANARY i386 built with clang. Peace, david -- David H. Wolfskill da...@catwhisker.org Taliban: Evil men with guns afraid of truth from a 14-year old girl. See http://www.catwhisker.org/~david/publickey.gpg for my public key. pgpgcOS3zRryn.pgp Description: PGP signature
Re: mfi panic on recused on non-recusive mutex MFI I/O lock
- Original Message - From: Steven Hartland ... I've just had another panic, trace below, but it doesn't seem to be related to my changes so I'd appreciate your feedback on them as they are for now. While the lock patch fixes the problems I've seen, its not clear to me why mfi_tbolt_reset is acquiring the lock and hence requiring mfi_process_fw_state_chg_isr to jump through hoops to ensure locking around queue manipulation is done correctly. Given what its doing (resetting the entire adapter) I wouldn't be surprised if it should really be acquiring the config lock. Other things I've noticed / questions * Should mfi_abort sleep even if its call to mfi_mapcmd fails? * Should mfi_get_controller_info really ignore the error from mfi_mapcmd? * Do these controllers not support none 512 byte requests? Currently all syspd requests are done assuming 512 byte sectors which the disk may not be. This will both reduce performance or potentially break totally if the firmware isn't translating it under the surface correctly. Anyway the new panic manually transcribed is:- panic: Bad linx elm 0xff0069b0fc0 next-prev != elm ... mfi_tbolt_get_cmd() mfi_build_mpt_pass_thru() mfi_tbolt_build_mpt_cmd() mfi_tbolt_send_frame() bus_dmamap_load() mfi_mapcmd() mfi_startio() mfi_syspd_strategy() g_disk_start() g_io_schedule_down() g_down_proc_body() fork_exit() fork_trampoline() Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I can tell all manip is done using the TAILQ macros and under mfi_io_lock so its not obvious to me at this time why this is, any ideas? I've gone through looking for the possible cause of this and while there's nothing directly connected to the manip of this queue I've found and fixed quite a large number of additional problems which may have been indirectly causing this problem. The biggest change is to use mfi_max_cmds to limit the value stored in sc-mfi_max_fw_cmds as this is used extensively throughout the driver for allocation and range checks so having this inconsitently set opened up a large number of possible overrun errors. The new patch attached documents all the changes in detail. I've managed to do one test run so far which failed to reproduce any panics, so definitely moving in the right direction :) The machine has now been collected for repair by the supplier but I'm going to try and get them to put it online for more testing over the weekend. Given the failure rate so far if I can do another 4 runs with no panics I'd be happy that the majority of error conditions are working as expected. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. zz-mfi-queue.patch Description: Binary data ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: mfi panic on recused on non-recusive mutex MFI I/O lock
On Fri, Nov 09, 2012 at 05:06:03PM -, Steven Hartland wrote: | | - Original Message - | From: Steven Hartland | ... | I've just had another panic, trace below, but it doesn't seem to be related | to my changes so I'd appreciate your feedback on them as they are for now. | | While the lock patch fixes the problems I've seen, its not clear to me | why mfi_tbolt_reset is acquiring the lock and hence requiring | mfi_process_fw_state_chg_isr to jump through hoops to ensure locking | around queue manipulation is done correctly. Given what its doing | (resetting the entire adapter) I wouldn't be surprised if it should | really be acquiring the config lock. | | Other things I've noticed / questions | * Should mfi_abort sleep even if its call to mfi_mapcmd fails? | * Should mfi_get_controller_info really ignore the error from mfi_mapcmd? | * Do these controllers not support none 512 byte requests? Currently | all syspd requests are done assuming 512 byte sectors which the disk may | not be. This will both reduce performance or potentially break totally | if the firmware isn't translating it under the surface correctly. | | Anyway the new panic manually transcribed is:- | panic: Bad linx elm 0xff0069b0fc0 next-prev != elm | ... | mfi_tbolt_get_cmd() | mfi_build_mpt_pass_thru() | mfi_tbolt_build_mpt_cmd() | mfi_tbolt_send_frame() | bus_dmamap_load() | mfi_mapcmd() | mfi_startio() | mfi_syspd_strategy() | g_disk_start() | g_io_schedule_down() | g_down_proc_body() | fork_exit() | fork_trampoline() | | Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I | can tell all manip is done using the TAILQ macros and under mfi_io_lock | so its not obvious to me at this time why this is, any ideas? | | I've gone through looking for the possible cause of this and while there's | nothing directly connected to the manip of this queue I've found and fixed | quite a large number of additional problems which may have been indirectly | causing this problem. | | The biggest change is to use mfi_max_cmds to limit the value stored in | sc-mfi_max_fw_cmds as this is used extensively throughout the driver | for allocation and range checks so having this inconsitently set opened up | a large number of possible overrun errors. | | The new patch attached documents all the changes in detail. | | I've managed to do one test run so far which failed to reproduce any panics, | so definitely moving in the right direction :) | | The machine has now been collected for repair by the supplier but I'm going | to try and get them to put it online for more testing over the weekend. | | Given the failure rate so far if I can do another 4 runs with no panics I'd | be happy that the majority of error conditions are working as expected. Sounds like you have made some good progress. I looked at your prior locking change and they good. Haven't had time to go through the queue changes yet. Thanks, Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: buildworld fails on recent stable
Thanks Dimitry. Confirmed successful build here, too. -- James. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: smartctl question
On Fri, Nov 9, 2012 at 3:47 AM, Lucas B. Cohen l...@bnrlabs.com wrote: Hi, On 2012.11.09 12:18, H. Ingow wrote: Hi all, one single disk in a zfs mirror failed permanently throwing errors like kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT ) and alike. The pool itself continued working degraded, smartctl showed a very high 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a broken cable, in this case indeed a cable replacement solved the problem, the pool resilvered and all is fine. Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon to be way too high, though ( 3900 ) . So is this value now including errors from previous broken cable ? I'm pretty sure it is. I don't think SMART attributes can vary in value both up and down ; they seem to me like they're counters that can only get incremented. In other words, when, if at all, is the cache smartmontools read from flushed and values are to be taken as of the status after fixing a hardware problem but not swapping the disk ? So, in my opinion no. This is a problem with S.M.A.R.T. All stats are stored by the drive in the drive and the assumption is that all of the errors are caused by problems in the drive (and usually are). But when they are from a cable problem, the drive never sees the problem as gone, so the counters never reset. As long as you remember that you had a cable problem with that drive and that the count was 199, you can discount it or recognize a problem down the road if it starts increasing. I'd put it on a label that can be stuck to the drive as a last reminder that the count is off by 199. By the way, I believe that some stats do go up and down, but not counters. Like in snmp, counters are never supposed to be reset or resettable. -- R. Kevin Oberman, Network Engineer E-mail: kob6...@gmail.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: smartctl question
Hi! Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon to be way too high, though ( 3900 ) . So is this value now including errors from previous broken cable ? In other words, when, if at all, is the cache smartmontools read from flushed and values are to be taken as of the status after fixing a hardware problem but not swapping the disk ? SMART values are stored in the drive, not on some cache in the system. The bad cable caused the drive to see errors. There is no way to reset the counters in the drive. So the error counter will stay at that value, but as long as it does no longer increase, you're fine. -- p...@opsec.eu+49 171 3101372 8 years to go ! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: smartctl question
If memory serves me right, Kevin Oberman wrote: By the way, I believe that some stats do go up and down, but not counters. Like in snmp, counters are never supposed to be reset or resettable. Examples of values that go up and down (actually the only examples I can think of) are the drive temperature and airflow temperature. But AFAIK you're right about the counter values. Bruce. signature.asc Description: OpenPGP digital signature