On 28.07.2013 03:08, Dieter BSD wrote:
Bob writes:
After a few hours of a database-like workload

A faster way to trigger the problem would be useful.

We're actually more interested in archive type workloads than this
database workload and we have not observed the problem with an archive
workload.

So perhaps something about the timing triggers the bug?

Sam writes
if you have a script or a way to build a kernel to help debug this I will
run it if you post it here... I have the same issue on a 3 port multiplier
using -HEAD

Can you share the make and model of this 3 port multiplier?
If it is happening with more than one model of pm, it is more likely
some generic problem, rather than triggering some model-specific quirk/bug.
Has anyone seen this problem with an older OS release? (say 7.x or 8.x?)
If the problem was introduced recently, we might be able to find it
by looking at what changed in the source code. I haven't seen the
problem with 8.2 or earlier.

Looks like a verbose boot will give a little more info.
But I suspect that adding more log(9) statements will be needed.
Unless mav has a better idea?

There are two sides of this problem: original issue and imperfect error recovery. First one is a big question. I can't say what is actually going on there that causes the problem. Just recently I've made one more attempt to get some documentation on SATA controllers from Marvell. But even after signing NDA process again stopped since I am neither buying thousands of their chips as vendor nor they are supporting for end-users. The alike situation is with other vendors.

What's about the recovery, problem is that neither CAM nor mvs driver now track faulty status of the devices. So if some disk's firmware stuck and start causing infinite timeouts, that will substantially interrupt operation of other devices sharing that SATA port. Probably the mechanism of dropping faulty device could be improved somehow.

What is about SAS, mentioned here -- that is quite different more expensive market. And even while protocols are much more sophisticated and hardware, firmware and software there are much better tested, there also situations happen sometimes when single misbehaving device may put down whole fabric.

--
Alexander Motin
_______________________________________________
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Reply via email to