FreeBSD CI Weekly Report 2019-02-03

2019-02-05 Thread Li-Wen Hsu
(bcc -current and -stable for more audience)

FreeBSD CI Weekly Report 2019-02-03
===

Here is a summary of the FreeBSD Continuous Integration results for
the period from 2019-01-28 to 2019-02-03.

During this period, we have:

* 2220 builds (93.7% passed, 6.2% failed, 0.1% exception) were
executed on aarch64, amd64, armv6, armv7, i386, mips, mips64, powerpc,
powerpc64, powerpcspe, riscv64, sparc64 architectures for building
world, GENERIC and LINT kernel of head, stable/12, stable/11 branches.
* 565 test runs (25.7% passed, 69.2% unstable, 5.1% exception) were
executed on amd64, i386, riscv64 architectures for head, stable/12,
stable/11 branches.
* 7 doc buils (85.8% passed, 14.2% failed)

If any of the issues found by CI are in your area of interest or
expertise please investigate the PRs listed below.

Web version of this report is available at
https://hackmd.io/s/B1CMKdJNE and archive is available at
http://hackfoldr.org/freebsd-ci-report/, any help is welcome.

## Failing Jobs

* https://ci.freebsd.org/job/FreeBSD-head-amd64-dtrace_test/
  after r343713 (Enable COVERAGE and KCOV by default on arm64 and
amd64.), the test VM exits while executing dtrace test cases.

## Fixed Jobs
* https://ci.freebsd.org/job/FreeBSD-head-amd64-gcc/8765/
* Fixed/workarounded by
* https://svnweb.freebsd.org/changeset/base/343670
* https://svnweb.freebsd.org/changeset/base/343671
* https://svnweb.freebsd.org/changeset/base/343672
* See also: https://bugs.freebsd.org/130067

## Failing Tests

* https://ci.freebsd.org/job/FreeBSD-head-amd64-test/
* lib.libc.sys.sendfile_test.hdtr_positive_v4
* lib.libc.sys.sendfile_test.hdtr_positive_v6
  see https://bugs.freebsd.org/235200 and
https://bugs.freebsd.org/234809 for deails

* https://ci.freebsd.org/job/FreeBSD-head-amd64-test_zfs/
* There are 61 (-3 since last report) failing cases, see
https://ci.freebsd.org/job/FreeBSD-head-amd64-test_zfs/lastCompletedBuild/testReport/
for more details

* https://ci.freebsd.org/job/FreeBSD-head-i386-test/
* sys.netmap.ctrl-api-test.main
* sys.opencrypto.runtests.main
* lib.libc.regex.exhaust_test.regcomp_too_big
* lib.libregex.exhaust_test.regcomp_too_big
* sys.kern.coredump_phnum_test.coredump_phnum
  WIP: https://reviews.freebsd.org/D18495
* lib.libc.sys.sendfile_test.hdtr_positive_v4
* lib.libc.sys.sendfile_test.hdtr_positive_v6
  see https://bugs.freebsd.org/235200 and
https://bugs.freebsd.org/234809 for deails

* https://ci.freebsd.org/job/FreeBSD-stable-12-i386-test/
* sbin.bectl.bectl_test.bectl_mount
* sys.netmap.ctrl-api-test.main
* sys.opencrypto.runtests.main
* lib.libc.regex.exhaust_test.regcomp_too_big
* lib.libregex.exhaust_test.regcomp_too_big
* sys.kern.coredump_phnum_test.coredump_phnum
  WIP: https://reviews.freebsd.org/D18495

* https://ci.freebsd.org/job/FreeBSD-stable-11-amd64-test/
* usr.bin.procstat.procstat_test.kernel_stacks

* https://ci.freebsd.org/job/FreeBSD-stable-11-i386-test/
* sys.netmap.ctrl-api-test.main
* sys.opencrypto.runtests.main
* usr.bin.procstat.procstat_test.environment
* usr.bin.procstat.procstat_test.kernel_stacks
* local.kyua.* (31 cases)
* local.lutok.* (3 cases)

## Disabled Tests

* lib.libc.sys.mmap_test.mmap_truncate_signal
  https://bugs.freebsd.org/211924
* sys.fs.tmpfs.mount_test.large
  https://bugs.freebsd.org/212862
* sys.fs.tmpfs.link_test.kqueue
  https://bugs.freebsd.org/213662
* sys.kqueue.libkqueue.kqueue_test.main
  https://bugs.freebsd.org/233586
* usr.bin.procstat.procstat_test.command_line_arguments
  https://bugs.freebsd.org/233587
* usr.bin.procstat.procstat_test.environment
  https://bugs.freebsd.org/233588
* lib.msun.{cbrt_test.cbrtl_powl,trig_test.reduction}
  https://bugs.freebsd.org/234040

## Open Issues

### Cause build fails

* [29: genassym.o build race](https://bugs.freebsd.org/29)
* [233735: Possible build race: genoffset.o /usr/src/sys/sys/types.h:
error: machine/endian.h: No such file or
directory](https://bugs.freebsd.org/233735)
* [233769: Possible build race: ld: error: unable to find library
-lgcc_s](https://bugs.freebsd.org/233769)

### Others
[Tickets related to testing@](https://preview.tinyurl.com/y9maauwg)

## Closed Issues

* [235097: ci runs panic with use-after-free when running
sys/netpfil/pf/nat tests](https://bugs.freebsd.org/235097)
* patch committed (https://svnweb.freebsd.org/changeset/base/343418) and
  MFC to 12 (https://svnweb.freebsd.org/changeset/base/343652) and
  11 (https://svnweb.freebsd.org/changeset/base/343653)
* [235411: sys.netpfil.pf.fragmentation.v6 panics after
r343631](https://bugs.freebsd.org/235411)
* Fixed in https://svnweb.freebsd.org/changeset/base/343678

## Other News

* Facebook's zstd has FreeBSD CI integrated with Cirrus CI:
https://github.com/facebook/zstd/pull/1501

* We have a job does lint check the doc with

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger
BTW under 12.0-STABLE (built this afternoon after the advisories came
out, with the patches) it's MUCH worse.  I get the same device resets
BUT it's followed by an immediate panic which I cannot dump as it
generates a page-fault (supervisor read data, page not present) in the
mps *driver* at mpssas_send_abort+0x21.

This precludes a dump of course since attempting to do so gives you a
double-panic (I was wondering why I didn't get a crash dump!); I'll
re-jigger the box to stick a dump device on an internal SATA device so I
can successfully get the dump when it happens and see if I can obtain a
proper crash dump on this.

I think it's fair to assume that 12.0-STABLE should not panic on a disk
problem (unless of course the problem is trying to page something back
in -- it's not, the drive that aborts and resets is on a data pack doing
a scrub)

On 2/5/2019 10:26, Karl Denninger wrote:
> On 2/5/2019 09:22, Karl Denninger wrote:
>> On 2/2/2019 12:02, Karl Denninger wrote:
>>> I recently started having some really oddball things  happening under
>>> stress.  This coincided with the machine being updated to 11.2-STABLE
>>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>>
>>> Specifically, I get "errors" like this:
>>>
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> length 131072 SMID 269 Aborting command 0xfe0001179110
>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
>>> c xfer 0
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
>>> c xfer 0
>>> mps0: Unfreezing devq for target ID 37
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: Command timeout
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>>> (da12:mps0:0:37:0): SCSI status: Check Condition
>>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>>> reset, or bus device reset occurred)
>>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>>
>>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>>> on two different brands of disk across multiple firmware and capacities,
>>> however, which tends to point away from a drive firmware problem.
>>>
>>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>>> etc) and a look at the disk itself (using smartctl) shows no problems
>>> either -- whatever is going on here the adapter is recovering from it
>>> without any data corruption or loss registered on *either end*!
>>>
>>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>>
>>> mps0:  port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c
>> After considerable additional work this looks increasingly like either a
>> missed interrupt or a command is getting lost between the host adapter
>> and the expander.
>>
>> I'm going to turn the driver debug level up and see if I can capture
>> more information. whatever is behind this, however, it is
>> almost-certainly related to something that changed between 11.1 and
>> 11.2, as I never saw these on the 11.1-STABLE build.
>>
>> --
>> Karl Denninger
>> k...@denninger.net 
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
> Pretty decent trace here -- any ideas?
>
> mps0: timedout cm 0xfe00011b5020 allocated tm 0xfe00011812a0
>     (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 Aborting command 0xfe00011b5020
> mps0: Sending reset from mpssas_send_abort for target ID 37
> mps0: queued timedout cm 0xfe00011c2760 for processing by tm
> 0xfe00011812a0
> mps0: queued timedout cm 0xfe00011a74f0 for processing by tm
> 0xfe00011812a0
> mps0: queued timedout cm 0xfe00011cfd50 for processing by tm
> 0xfe00011812a0
> mps0: EventReply    :
>     EventDataLength: 2
>     AckRequired: 0
>     Event: SasDiscovery (0x16)
>     EventContext: 0x0
>     Flags: 1
>     ReasonCode: Discovery Started
>     PhysicalPort: 0
>     DiscoveryStatus: 0
> mps0: 

Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Eugene Grosbein
06.02.2019 3:55, Ian Lepore wrote:

> So your problem was most likely the gps receiver making a bad choice
> before it had enough info to make a good choice. It's one of many
> reasons why an ntp server should have at least 3 (really, at least 5)
> peers, so it can reject obviously-insane data from a single source.
> Even when you use a gps to get really accurate local time, you should
> have a handful of network peers that can serve as sanity-checkers.

And I have and had that moment:

driftfile /var/db/ntpd.drift
server Time2.Stupi.SE   iburst maxpoll 9
server ntp1.sp.se   iburst maxpoll 9
server ntp1.mmo.netnod.se   iburst maxpoll 9
server ntp1.ptb.de  iburst maxpoll 9
server ntp1.ien.it  iburst maxpoll 9
server ntp1.sth.netnod.se   iburst maxpoll 9
server 127.127.1.0
fudge 127.127.1.0 stratum 10
# Had to comment out following 3 lines after the incident
#tos mindist 0.015
#server 127.127.20.1 mode 1 iburst maxpoll 9 prefer
#fudge 127.127.20.1 stratum 10 time1 0.000 time2 0.000 flag1 1 flag3 1 refid PPS
pool 0.freebsd.pool.ntp.org iburst
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
restrict 127.0.0.1
restrict ::1
leapfile "/var/db/ntpd.leap-seconds.list"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Ian Lepore
On Wed, 2019-02-06 at 03:46 +0700, Eugene Grosbein wrote:
> 06.02.2019 3:18, Ian Lepore wrote:
> 
> > > 2019, of course.  re@ does NOT make mistakes.  What you fail to
> > > realize is that NIST was using kqueue to check their atomic
> > > clock, and
> > > they lost the race.  Enjoy the rest of 2020.
> > > -Alan
> > > 
> > 
> > I think you meant that as a joke, but the reality is that NIST
> > measures
> > their atomic clocks using gear that runs FreeBSD (made by the
> > company I
> > work for). :)
> 
> I do not know if it is related or not: some months ago my FreeBSD
> 11.2-STABLE box
> having GPS received attached at /dev/cuau0 for my local ntpd stratum
> 1 server
> went to late of 2020 insanely. I was forced to comment GPS out of
> ntpd config to revive it
> but I lost all data in hundreds of local RRD databases and
> I found a race in libarchive being a reason why my backups had not
> most part of databases.
> 
> I still do not know exact reason and use Internet time source instead
> of local GPS.

The GPS week number rolls over in April 2019. At $work we have already
been seeing glitches in gps receivers as early as last June that were
caused by errors in the receivers' firmware that didn't handle the
upcoming rollover properly.

When a receiver first powers on, it has no real idea what gps era it's
in (right now we're in the 2nd, about to roll over to the 3rd). It has
to guess, which it mostly does the same way as software does that has
to deal with 2-digit years: make a decision based on the current date
being before/after some cutoff (like > 70 means 2070), and assume
everyone will be running newer firmware before that date arrives.

So your problem was most likely the gps receiver making a bad choice
before it had enough info to make a good choice. It's one of many
reasons why an ntp server should have at least 3 (really, at least 5)
peers, so it can reject obviously-insane data from a single source.
Even when you use a gps to get really accurate local time, you should
have a handful of network peers that can serve as sanity-checkers.

-- Ian

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Eugene Grosbein
06.02.2019 3:18, Ian Lepore wrote:

>> 2019, of course.  re@ does NOT make mistakes.  What you fail to
>> realize is that NIST was using kqueue to check their atomic clock, and
>> they lost the race.  Enjoy the rest of 2020.
>> -Alan
>>
> 
> I think you meant that as a joke, but the reality is that NIST measures
> their atomic clocks using gear that runs FreeBSD (made by the company I
> work for). :)

I do not know if it is related or not: some months ago my FreeBSD 11.2-STABLE 
box
having GPS received attached at /dev/cuau0 for my local ntpd stratum 1 server
went to late of 2020 insanely. I was forced to comment GPS out of ntpd config 
to revive it
but I lost all data in hundreds of local RRD databases and
I found a race in libarchive being a reason why my backups had not most part of 
databases.

I still do not know exact reason and use Internet time source instead of local 
GPS.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Ian Lepore
On Tue, 2019-02-05 at 13:00 -0700, Alan Somers wrote:
> On Tue, Feb 5, 2019 at 12:55 PM Shawn Webb  wrote:
> > 
> > On Wed, Jan 09, 2019 at 07:40:30PM +, FreeBSD Errata Notices wrote:
> > > -BEGIN PGP SIGNED MESSAGE-
> > > Hash: SHA512
> > > 
> > > =
> > > FreeBSD-EN-19:05.kqueue Errata 
> > > Notice
> > >   The FreeBSD 
> > > Project
> > > 
> > > Topic:  kqueue race condition and kernel panic
> > > 
> > > Category:   core
> > > Module: kqueue
> > > Announced:  2019-01-09
> > > Credits:Mark Johnston
> > > Affects:FreeBSD 11.2
> > > Corrected:  2019-11-24 17:11:47 UTC (stable/11, 11.2-STABLE)
> > 
> > Corrected November of 2018 or 2019? ;)
> 
> 2019, of course.  re@ does NOT make mistakes.  What you fail to
> realize is that NIST was using kqueue to check their atomic clock, and
> they lost the race.  Enjoy the rest of 2020.
> -Alan
> 

I think you meant that as a joke, but the reality is that NIST measures
their atomic clocks using gear that runs FreeBSD (made by the company I
work for). :)

-- Ian


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Alan Somers
On Tue, Feb 5, 2019 at 12:55 PM Shawn Webb  wrote:
>
> On Wed, Jan 09, 2019 at 07:40:30PM +, FreeBSD Errata Notices wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > =
> > FreeBSD-EN-19:05.kqueue Errata 
> > Notice
> >   The FreeBSD 
> > Project
> >
> > Topic:  kqueue race condition and kernel panic
> >
> > Category:   core
> > Module: kqueue
> > Announced:  2019-01-09
> > Credits:Mark Johnston
> > Affects:FreeBSD 11.2
> > Corrected:  2019-11-24 17:11:47 UTC (stable/11, 11.2-STABLE)
>
> Corrected November of 2018 or 2019? ;)

2019, of course.  re@ does NOT make mistakes.  What you fail to
realize is that NIST was using kqueue to check their atomic clock, and
they lost the race.  Enjoy the rest of 2020.
-Alan
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: [FreeBSD-Announce] FreeBSD Errata Notice FreeBSD-EN-19:05.kqueue

2019-02-05 Thread Shawn Webb
On Wed, Jan 09, 2019 at 07:40:30PM +, FreeBSD Errata Notices wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
> 
> =
> FreeBSD-EN-19:05.kqueue Errata Notice
>   The FreeBSD Project
> 
> Topic:  kqueue race condition and kernel panic
> 
> Category:   core
> Module: kqueue
> Announced:  2019-01-09
> Credits:Mark Johnston
> Affects:FreeBSD 11.2
> Corrected:  2019-11-24 17:11:47 UTC (stable/11, 11.2-STABLE)

Corrected November of 2018 or 2019? ;)

-- 
Shawn Webb
Cofounder and Security Engineer
HardenedBSD

Tor-ified Signal:+1 443-546-8752
Tor+XMPP+OTR:latt...@is.a.hacker.sx
GPG Key ID:  0x6A84658F52456EEE
GPG Key Fingerprint: 2ABA B6BD EF6A F486 BE89  3D9E 6A84 658F 5245 6EEE


signature.asc
Description: PGP signature


Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger
On 2/5/2019 09:22, Karl Denninger wrote:
> On 2/2/2019 12:02, Karl Denninger wrote:
>> I recently started having some really oddball things  happening under
>> stress.  This coincided with the machine being updated to 11.2-STABLE
>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>
>> Specifically, I get "errors" like this:
>>
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> length 131072 SMID 269 Aborting command 0xfe0001179110
>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
>> c xfer 0
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
>> c xfer 0
>> mps0: Unfreezing devq for target ID 37
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: Command timeout
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>> (da12:mps0:0:37:0): SCSI status: Check Condition
>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>> reset, or bus device reset occurred)
>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>
>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>> on two different brands of disk across multiple firmware and capacities,
>> however, which tends to point away from a drive firmware problem.
>>
>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>> etc) and a look at the disk itself (using smartctl) shows no problems
>> either -- whatever is going on here the adapter is recovering from it
>> without any data corruption or loss registered on *either end*!
>>
>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>
>> mps0:  port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c
> After considerable additional work this looks increasingly like either a
> missed interrupt or a command is getting lost between the host adapter
> and the expander.
>
> I'm going to turn the driver debug level up and see if I can capture
> more information. whatever is behind this, however, it is
> almost-certainly related to something that changed between 11.1 and
> 11.2, as I never saw these on the 11.1-STABLE build.
>
> --
> Karl Denninger
> k...@denninger.net 
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/
Pretty decent trace here -- any ideas?

mps0: timedout cm 0xfe00011b5020 allocated tm 0xfe00011812a0
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 Aborting command 0xfe00011b5020
mps0: Sending reset from mpssas_send_abort for target ID 37
mps0: queued timedout cm 0xfe00011c2760 for processing by tm
0xfe00011812a0
mps0: queued timedout cm 0xfe00011a74f0 for processing by tm
0xfe00011812a0
mps0: queued timedout cm 0xfe00011cfd50 for processing by tm
0xfe00011812a0
mps0: EventReply    :
    EventDataLength: 2
    AckRequired: 0
    Event: SasDiscovery (0x16)
    EventContext: 0x0
    Flags: 1
    ReasonCode: Discovery Started
    PhysicalPort: 0
    DiscoveryStatus: 0
mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
mps0: (1)->(mpssas_fw_work) Event Free: [16]
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
length 131072 SMID 961 completed timedout cm 0xfe00011cfd50 ccb
0xf8019458e000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
131072 SMID 961 terminated ioc 804b loginfo 3114 scsi 0 state c xfer 0
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 completed timedout cm
0xfe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
01 00 00
020 ccb 0xf80056fb5000 during recovery ioc 8048 scsi 0 state
c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
length 131072 SMID 798 completed timedout cm 0xfe00011c2760 ccb
0xf80054e86000 during recovery ioc 804b scsi 0 state 

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger

On 2/2/2019 12:02, Karl Denninger wrote:
> I recently started having some really oddball things  happening under
> stress.  This coincided with the machine being updated to 11.2-STABLE
> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>
> Specifically, I get "errors" like this:
>
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> length 131072 SMID 269 Aborting command 0xfe0001179110
> mps0: Sending reset from mpssas_send_abort for target ID 37
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
> c xfer 0
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
> c xfer 0
> mps0: Unfreezing devq for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: Command timeout
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: SCSI Status Error
> (da12:mps0:0:37:0): SCSI status: Check Condition
> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da12:mps0:0:37:0): Retrying command (per sense data)
>
> The "Unit Attention" implies the drive reset.  It only occurs on certain
> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
> on two different brands of disk across multiple firmware and capacities,
> however, which tends to point away from a drive firmware problem.
>
> A look at the pool data shows /no /errors (e.g. no checksum problems,
> etc) and a look at the disk itself (using smartctl) shows no problems
> either -- whatever is going on here the adapter is recovering from it
> without any data corruption or loss registered on *either end*!
>
> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>
> mps0:  port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c

After considerable additional work this looks increasingly like either a
missed interrupt or a command is getting lost between the host adapter
and the expander.

I'm going to turn the driver debug level up and see if I can capture
more information. whatever is behind this, however, it is
almost-certainly related to something that changed between 11.1 and
11.2, as I never saw these on the 11.1-STABLE build.

--
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: More CARP issues under 12

2019-02-05 Thread Pete French
> Hi,
>
> What branch and revision do you use? Can you install gdb and then obtain
> this information:

The branch and revision is 12.0-STABLE r343538 GENERIC

> # kgdb
>
> (kgdb) list *ether_output+0x6b6

trying to do this on the actual box is hard, as it panics, but on another
machine running the same build I get this, which should suffice if you
are just interested in seeing the line in the source code ?

(kgdb)  list *ether_output+0x6b6
0x80ca1526 is in ether_output (/usr/src/sys/net/if_ethersubr.c:435).
430 if (m == NULL)
431 return (0);
432 }
433
434 /* Continue with link-layer output */
435 return ether_output_frame(ifp, m);
436 }
437
438 static bool
439 ether_set_pcp(struct mbuf **mp, struct ifnet *ifp, uint8_t pcp)


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: More CARP issues under 12

2019-02-05 Thread Andrey V. Elsukov
On 17.01.2019 15:19, Pete French wrote:
> so, having got a workaround for yesterdays problems, I now went to upgrade my
> other pair of boxes using CARP. No 'pf' on these, just one shared address.
> This is the setup I have tested in development and it works fine.
> 
> I install the new kenel and do the first reboot - and I get the panic
> below. Maybe its not carp related, but seems suspicious as the last
> thing it spits out is a carp message.
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x28
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x80ca0de1
> stack pointer   = 0x28:0xfe4da740
> frame pointer   = 0x28:0xfe4da760
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 12 (swi4: clock (0))
> trap number = 12
> panic: page fault
> cpuid = 0
> time = 1547727391
> KDB: stack backtrace:
> #0 0x80be8597 at kdb_backtrace+0x67
> #1 0x80b9ccf3 at vpanic+0x1a3
> #2 0x80b9cb43 at panic+0x43
> #3 0x8107382f at trap_fatal+0x35f
> #4 0x81073889 at trap_pfault+0x49
> #5 0x81072eae at trap+0x29e
> #6 0x8104e1a5 at calltrap+0x8
> #7 0x80ca0ce6 at ether_output+0x6b6
> #8 0x80d0bda4 at arprequest+0x4c4
> #9 0x80d0d9fc at garp_rexmit+0xbc
> #10 0x80bb6ba9 at softclock_call_cc+0x129
> #11 0x80bb7089 at softclock+0x79
> #12 0x80b60e79 at ithread_loop+0x169
> #13 0x80b5e012 at fork_exit+0x82
> #14 0x8104f18e at fork_trampoline+0xe
> Uptime: 19s

Hi,

What branch and revision do you use? Can you install gdb and then obtain
this information:

# kgdb

(kgdb) list *ether_output+0x6b6

-- 
WBR, Andrey V. Elsukov



signature.asc
Description: OpenPGP digital signature


Kernel panic going multiuser under 12 ( was Re: More CARP issues under 12 (maybe not CARP after all))

2019-02-05 Thread Pete French




Just to get the subject correct, as I tested this disabling CARP and I 
still see the panic when going multi-user. It netwprking related as the 
panic is in the ARP code, and seems to happen when the network 
interfaces are configured. The machine was using a mix of em and igb 
interfaces, but is now igb only.


-pete.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"