On 4/11/2019 13:57, Karl Denninger wrote:
> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger wrote:
>>
>>
>>> In this specific case the adapter in question is...
>>>
>>> mps0: port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c
>>>
>>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
>>> his drives via dumb on-MoBo direct SATA connections.
>>>
>> Maybe I'm in good company. My current setup has 8 of the disks connected
>> to:
>>
>> mps0: port 0xb000-0xb0ff mem
>> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 5a85c
>>
>> ... just with a cable that breaks out each of the 2 connectors into 4
>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>> cache/log) connected to ports on...
>>
>> - ahci0: port
>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
>> - ahci2: port
>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
>> - ahci3: port
>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>
>> ... each drive connected to a single port.
>>
>> I can actually reproduce this at will. Because I have 16 drives, when one
>> fails, I need to find it. I pull the sata cable for a drive, determine if
>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
>> resilver to stop... usually only a minute or two.
>>
>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
>> that a drive is part of the SAS controller or the SATA controllers... so
>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
>> More often than not, the a scrub will find a few problems. In fact, it
>> appears that the most recent scrub is an example:
>>
>> [1:7:306]dgilbert@vr:~> zpool status
>> pool: vr1
>> state: ONLINE
>> scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr 1 23:12:03
>> 2019
>> config:
>>
>> NAMESTATE READ WRITE CKSUM
>> vr1 ONLINE 0 0 0
>> raidz2-0 ONLINE 0 0 0
>> gpt/v1-d0 ONLINE 0 0 0
>> gpt/v1-d1 ONLINE 0 0 0
>> gpt/v1-d2 ONLINE 0 0 0
>> gpt/v1-d3 ONLINE 0 0 0
>> gpt/v1-d4 ONLINE 0 0 0
>> gpt/v1-d5 ONLINE 0 0 0
>> gpt/v1-d6 ONLINE 0 0 0
>> gpt/v1-d7 ONLINE 0 0 0
>> raidz2-2 ONLINE 0 0 0
>> gpt/v1-e0c ONLINE 0 0 0
>> gpt/v1-e1b ONLINE 0 0 0
>> gpt/v1-e2b ONLINE 0 0 0
>> gpt/v1-e3b ONLINE 0 0 0
>> gpt/v1-e4b ONLINE 0 0 0
>> gpt/v1-e5a ONLINE 0 0 0
>> gpt/v1-e6a ONLINE 0 0 0
>> gpt/v1-e7c ONLINE 0 0 0
>> logs
>> gpt/vr1logONLINE 0 0 0
>> cache
>> gpt/vr1cache ONLINE 0 0 0
>>
>> errors: No known data errors
>>
>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
>> drives that I had trial-removed (and not on the one replaced).
>> ___
> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
> after a scrub, comes up with the checksum errors. It does *not* flag
> any errors during the resilver and the drives *not* taken offline do not
> (ever) show checksum errors either.
>
> Interestingly enough you have 19.00.00.00 firmware on your card as well
> -- which is what was on mine.
>
> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
> does it when I do the next swap of the backup set.
Verry interesting.
This drive was last written/read under 19.00.00.00. Yesterday I swapped
it back in. Note that right now I am running:
mps0: port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c
And, after the scrub completed overnight
[karl@NewFS ~]$ zpool status backup
pool: backup
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using