2018-12-26 22:26, Terry Kennedy wrote:
The earlier LSI P20 releases were pretty flakey in some cases - try
flashing 20.00.07.00.


Indeed.

I have upgraded LSI SAS2308 firmware from 20.00.02.00 to 20.00.07.00
a week ago, left it running for a while with 11.2, then upgraded again
to 12.0, and the controller is stable now, even with the new mps driver
that came with 12.0.

To recap:

- mps driver from FreeBSD 11.2 and earlier is stable with SAS2308 firmware
   20.00.02.00 _and_ 20.00.07.00

 - mps driver from FreeBSD 12.0 causes frequent controller resets
   with SAS2308 firmware 20.00.02.00 (and ZFS can't cope with that),
   but is stable with 20.00.07.00.

Mark




2018-12-17 16:52, je Mark Martinec napisal
One of our servers that was upgraded from 11.2 to 12.0 (to RC2
initially, then to RC3
and lastly to a 12.0-RELEASE) is suffering severe instability of a
disk controller,
resetting itself a couple of times a day, usually associated with high
disk usage
(like poudriere buils or zfs scrub or nightly file system scans). The same setup
was rock-solid under 11.2 (and still/again is).

The disk controller is LSI SAS2308. It has four disks attached as JBODs, one pair of SSDs and one pair of hard disks, each pair forming its own zpool.
A controller reset can occur regardless of which pair is in heavy use.

The following can be found in logs, just before machine becomes unusable (although not logged always, as disks may be dropped before syslog has a chance
of writing anything):

  xxx kernel: [2382] mps0: IOC Fault 0x40000d04, Resetting
  xxx kernel: [2382] mps0: Reinitializing controller
xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd
  xxx kernel: [2383] mps0: IOCCapabilities:
5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
  xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack

The IOC Fault location is always the same. Apparently the disk
controller resets,
all disk devices are dropped and ZFS finds itself with no disks. The
machine still
responds to ping, and if logged-in during the event and running zpool
status -v 1,
zfs reports loss of all devices for each pool:

  pool: data0
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
   scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov
17 00:22:38 2018
config:

        NAME                      STATE     READ WRITE CKSUM
        data0                     UNAVAIL      0     0     0
          mirror-0                UNAVAIL      0    24     0
             2396428274137360341   REMOVED      0     0     0  was
/dev/gpt/da2-PN1334PCKAKD4S
             16738407333921736610  REMOVED      0     0     0  was
/dev/gpt/da3-PN2338P4GJ1XYC

(and similar for the other pool)

At this point the machine is unusable and needs to be hard-reset.

My guess is that after the controller resets, disk devices come up again (according to the report seen on the console, stating 'periph destroyed'
first, then listing full info on each disk) - but zfs ignores them.

I don't see any mention of changes of the mps driver in the 12.0 release notes, although diff-ing its sources between 11.2 and 12.0 shows plenty of nontrivial
changes.

After suffering this instability for some time, I finally downgraded the OS
to 11.2, and things are back to normal again!

This downgrade path was nontrivial, as I have foolishly upgraded pool features to what comes with 12.0, so downgrading involved hacking with dismantling
both zfs mirror pools, recreating pools without the two new features,
zfs send/receive copying, while having a machine hang during some of
these operations. Not something for the faint at heart. I know, foolish
of me to upgrade pools after just one day of uptime with 12.0.

Some info on the controller:

kernel: mps0: <Avago Technologies (LSI) SAS2308> port 0xf000-0xf0ff
mem 0xfbe40000-
0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 numa-domain 1 on pci11
kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd

mpsutil shows:

  mps0 Adapter:
    Board Name: LSI2308-IT
    Board Assembly:
    Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.00.00
    Firmware Revision: 20.00.02.00
    Integrated RAID: no


So, what has changed in the mps driver for this to be happening?
Would it be possible to take mps driver sources from 11.2, transplant
them to 12.0, recompile, and use that? Could the new mps driver be
using some new feature of the controller and hits a firmware bug?
I have resisted upgrading SAS2308 firmware and its BIOS, as it is
working very well under 11.2.

Anyone else seen problems with mps driver and LSI SAS2308 controller?

(btw, on another machine the mps driver with LSI SAS2004 is working
just fine under 12.0)

  Mark
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Reply via email to