smartctl question

2012-11-09 Thread H. Ingow

Hi all,

one single disk in a zfs mirror failed permanently throwing errors like 
kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84
(ICRC ABRT ) and alike.
 
The pool itself continued working degraded, smartctl showed a very high
199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a
broken cable, in this case indeed a  cable replacement  solved the
problem, the pool resilvered and all is fine.

Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon 
to be way too high, though (  3900 ) .
So is this value now including errors from previous broken cable ?

In other words, when, if at all, is the cache smartmontools read from
flushed and values are to be taken as of the status after fixing a
hardware problem but not swapping the disk ? 
Can someone please share some insight ?

thanks 

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smartctl question

2012-11-09 Thread Lucas B. Cohen
Hi,

On 2012.11.09 12:18, H. Ingow wrote:
 
 Hi all,
 
 one single disk in a zfs mirror failed permanently throwing errors like 
 kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84
 (ICRC ABRT ) and alike.
  
 The pool itself continued working degraded, smartctl showed a very high
 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a
 broken cable, in this case indeed a  cable replacement  solved the
 problem, the pool resilvered and all is fine.
 
 Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon 
 to be way too high, though (  3900 ) .
 So is this value now including errors from previous broken cable ?

I'm pretty sure it is. I don't think SMART attributes can vary in value
both up and down ; they seem to me like they're counters that can only
get incremented.

 In other words, when, if at all, is the cache smartmontools read from
 flushed and values are to be taken as of the status after fixing a
 hardware problem but not swapping the disk ?
So, in my opinion no.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: buildworld fails on recent stable

2012-11-09 Thread Jakub Lach
Thanks, looking forward to MFC!



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/buildworld-fails-on-recent-stable-tp5758273p5759526.html
Sent from the freebsd-stable mailing list archive at Nabble.com.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: buildworld fails on recent stable

2012-11-09 Thread David Wolfskill
On Fri, Nov 09, 2012 at 12:28:11AM +0100, Dimitry Andric wrote:
 ...
 I have also looked at merging the snapshot of 3.2 we now have in head to
 stable/9, but it is also quite some work, so I found a better solution:
 I managed to shrink boot2 by enough bytes to make it fit again.
 
 I committed the change to head in r242804, and I will MFC it in 3 days,
 if there are no regressions reported.  Meanwhile, please apply the
 attached patch.
 

Works for me -- I'm now running:

FreeBSD g1-227.catwhisker.org 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #292 
242822M: Fri Nov  9 04:26:20 PST 2012 
r...@g1-227.catwhisker.org:/usr/obj/usr/src/sys/CANARY  i386

built with clang.

Peace,
david
-- 
David H. Wolfskill  da...@catwhisker.org
Taliban: Evil men with guns afraid of truth from a 14-year old girl.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.


pgpgcOS3zRryn.pgp
Description: PGP signature


Re: mfi panic on recused on non-recusive mutex MFI I/O lock

2012-11-09 Thread Steven Hartland


- Original Message - 
From: Steven Hartland

...

I've just had another panic, trace below, but it doesn't seem to be related
to my changes so I'd appreciate your feedback on them as they are for now.

While the lock patch fixes the problems I've seen, its not clear to me
why mfi_tbolt_reset is acquiring the lock and hence requiring
mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
around queue manipulation is done correctly. Given what its doing
(resetting the entire adapter) I wouldn't be surprised if it should
really be acquiring the config lock.

Other things I've noticed / questions
* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
* Do these controllers not support none 512 byte requests? Currently
all syspd requests are done assuming 512 byte sectors which the disk may
not be. This will both reduce performance or potentially break totally
if the firmware isn't translating it under the surface correctly.

Anyway the new panic manually transcribed is:-
panic: Bad linx elm 0xff0069b0fc0 next-prev != elm
...
mfi_tbolt_get_cmd()
mfi_build_mpt_pass_thru()
mfi_tbolt_build_mpt_cmd()
mfi_tbolt_send_frame()
bus_dmamap_load()
mfi_mapcmd()
mfi_startio()
mfi_syspd_strategy()
g_disk_start()
g_io_schedule_down()
g_down_proc_body()
fork_exit()
fork_trampoline()

Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
can tell all manip is done using the TAILQ macros and under mfi_io_lock
so its not obvious to me at this time why this is, any ideas?


I've gone through looking for the possible cause of this and while there's
nothing directly connected to the manip of this queue I've found and fixed
quite a large number of additional problems which may have been indirectly
causing this problem.

The biggest change is to use mfi_max_cmds to limit the value stored in
sc-mfi_max_fw_cmds as this is used extensively throughout the driver
for allocation and range checks so having this inconsitently set opened up
a large number of possible overrun errors.

The new patch attached documents all the changes in detail.

I've managed to do one test run so far which failed to reproduce any panics,
so definitely moving in the right direction :)

The machine has now been collected for repair by the supplier but I'm going
to try and get them to put it online for more testing over the weekend.

Given the failure rate so far if I can do another 4 runs with no panics I'd
be happy that the majority of error conditions are working as expected.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

zz-mfi-queue.patch
Description: Binary data
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mfi panic on recused on non-recusive mutex MFI I/O lock

2012-11-09 Thread Doug Ambrisko
On Fri, Nov 09, 2012 at 05:06:03PM -, Steven Hartland wrote:
| 
| - Original Message - 
| From: Steven Hartland
| ...
| I've just had another panic, trace below, but it doesn't seem to be related
| to my changes so I'd appreciate your feedback on them as they are for now.
| 
| While the lock patch fixes the problems I've seen, its not clear to me
| why mfi_tbolt_reset is acquiring the lock and hence requiring
| mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
| around queue manipulation is done correctly. Given what its doing
| (resetting the entire adapter) I wouldn't be surprised if it should
| really be acquiring the config lock.
| 
| Other things I've noticed / questions
| * Should mfi_abort sleep even if its call to mfi_mapcmd fails?
| * Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
| * Do these controllers not support none 512 byte requests? Currently
| all syspd requests are done assuming 512 byte sectors which the disk may
| not be. This will both reduce performance or potentially break totally
| if the firmware isn't translating it under the surface correctly.
| 
| Anyway the new panic manually transcribed is:-
| panic: Bad linx elm 0xff0069b0fc0 next-prev != elm
| ...
| mfi_tbolt_get_cmd()
| mfi_build_mpt_pass_thru()
| mfi_tbolt_build_mpt_cmd()
| mfi_tbolt_send_frame()
| bus_dmamap_load()
| mfi_mapcmd()
| mfi_startio()
| mfi_syspd_strategy()
| g_disk_start()
| g_io_schedule_down()
| g_down_proc_body()
| fork_exit()
| fork_trampoline()
| 
| Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
| can tell all manip is done using the TAILQ macros and under mfi_io_lock
| so its not obvious to me at this time why this is, any ideas?
| 
| I've gone through looking for the possible cause of this and while there's
| nothing directly connected to the manip of this queue I've found and fixed
| quite a large number of additional problems which may have been indirectly
| causing this problem.
| 
| The biggest change is to use mfi_max_cmds to limit the value stored in
| sc-mfi_max_fw_cmds as this is used extensively throughout the driver
| for allocation and range checks so having this inconsitently set opened up
| a large number of possible overrun errors.
| 
| The new patch attached documents all the changes in detail.
| 
| I've managed to do one test run so far which failed to reproduce any panics,
| so definitely moving in the right direction :)
| 
| The machine has now been collected for repair by the supplier but I'm going
| to try and get them to put it online for more testing over the weekend.
| 
| Given the failure rate so far if I can do another 4 runs with no panics I'd
| be happy that the majority of error conditions are working as expected.

Sounds like you have made some good progress.  I looked at your prior locking
change and they good.  Haven't had time to go through the queue changes
yet.

Thanks,

Doug A.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: buildworld fails on recent stable

2012-11-09 Thread James
Thanks Dimitry. Confirmed successful build here, too.

--
James.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smartctl question

2012-11-09 Thread Kevin Oberman
On Fri, Nov 9, 2012 at 3:47 AM, Lucas B. Cohen l...@bnrlabs.com wrote:
 Hi,

 On 2012.11.09 12:18, H. Ingow wrote:

 Hi all,

 one single disk in a zfs mirror failed permanently throwing errors like
 kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84
 (ICRC ABRT ) and alike.

 The pool itself continued working degraded, smartctl showed a very high
 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a
 broken cable, in this case indeed a  cable replacement  solved the
 problem, the pool resilvered and all is fine.

 Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon
 to be way too high, though (  3900 ) .
 So is this value now including errors from previous broken cable ?

 I'm pretty sure it is. I don't think SMART attributes can vary in value
 both up and down ; they seem to me like they're counters that can only
 get incremented.

 In other words, when, if at all, is the cache smartmontools read from
 flushed and values are to be taken as of the status after fixing a
 hardware problem but not swapping the disk ?
 So, in my opinion no.

This is a problem with S.M.A.R.T. All stats are stored by the drive in
the drive and the assumption is that all of the errors are caused by
problems in the drive (and usually are). But when they are from a
cable problem, the drive never sees the problem as gone, so the
counters never reset. As long as you remember that you had a cable
problem with that drive and that the count was 199, you can discount
it or recognize a problem down the road if it starts increasing. I'd
put it on a label that can be stuck to the drive as a last reminder
that the count is off by 199.

By the way, I believe that some stats do go up and down, but not
counters. Like in snmp, counters are never supposed to be reset or
resettable.
-- 
R. Kevin Oberman, Network Engineer
E-mail: kob6...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smartctl question

2012-11-09 Thread Kurt Jaeger
Hi!

 Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon 
 to be way too high, though (  3900 ) .
 So is this value now including errors from previous broken cable ?

 In other words, when, if at all, is the cache smartmontools read from
 flushed and values are to be taken as of the status after fixing a
 hardware problem but not swapping the disk ? 

SMART values are stored in the drive, not on some cache in the system.

The bad cable caused the drive to see errors. There is no way
to reset the counters in the drive.

So the error counter will stay at that value, but as long as it
does no longer increase, you're fine.

-- 
p...@opsec.eu+49 171 3101372 8 years to go !
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: smartctl question

2012-11-09 Thread Bruce A. Mah
If memory serves me right, Kevin Oberman wrote:

 By the way, I believe that some stats do go up and down, but not
 counters. Like in snmp, counters are never supposed to be reset or
 resettable.

Examples of values that go up and down (actually the only examples I can
think of) are the drive temperature and airflow temperature.  But AFAIK
you're right about the counter values.

Bruce.




signature.asc
Description: OpenPGP digital signature