Hi Stefan

Thank you very much for your feedback, suggestions and hints.

Indeed yesterday I saw one read and one write error related to Samsung PRO SSDs before another OS crash (I run more different tests writing big files to the RAID5 using "dd" or "cat" commands) Today I have installed three new 1TB Samsung PRO 960 SSD drives inside a third box (however also an ASUS mainboard with AMD FX CPU and 16GB ECC RAM) and set RAID5 as described in the attached file.

And again a similar error after dd (slightly different values):
# ---
dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024

# Error messages

uvm_fault(0xffffffff821ede50, 0x40, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      sr_validate_io+0x44:    cmpl     $0,0x40(%r9)
ddb{4}>

The error happens on the RAID5 level (there is no encryption).

In the test case above I used 30cm long SATA 3G cables (Samsung PRO 860 and the SATA controller are 6G) as I did not have the 6G SATA cables available.
I run the original tests with 6G SATA cables.

For some reason the "ddb{4}>" is frozen so I am not able to type anything on the ddb input prompt on the console (and I don't see any output typing  blind "trace" or "ps" ).

I have somewhere some older Samsung PRO 850 SSDs so I will try to test the RAID5 configuration with them.

Kind regards
Mark


On 28.02.21 20:17, Stefan Sperling wrote:
On Sun, Feb 28, 2021 at 03:05:49AM +0100, Mark Schneider wrote:
Hi again,

I have repeated softraid tests using six pcs of 1TB Samsung HDD 3G SATA
drives as RAID5 and I do not face the crash issue of the OS when using SSDs
in the RAID5.
Details of the RAID5 setting are in the attached file.

It looks like using SSD drives as RAID5 leads for some reason to the OpenBSD
6.8 crash. Samsung 512MB PRO 860 SSDs have 6G SATA interface (what is
different compared to tested HDDs)

NB: Using those SSDs as RAID6 on debian Linux (buster - mdadm / cryptoLUKS)
does not face any issues
       There are also no issues using those SSDs as RAID on FreeBSD
(TrueNAS).
I've seen some Samsung Pro SSDs cause I/O errors on ahci(4) due to unhandled
NCQ error conditions. Not sure if this relates to your problem; I assume that
these errors were specific to my machine, which is over 10 years old. Its AHCI
controller has likely not been designed with modern SSDs in mind. I switched
to different SSDs and the problem disappeared. This was on RAID1 where the
kernel didn't crash. Instead, the volume ended up in degraded state.

Maybe some I/O error is happening in your case as well?
Perhaps the raid5 code doesn't handle i/o errors gracefully?

In any case, your bug report is missing important information:

# Error messages

uvm_fault(0xffffffff821f5490, 0x40, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      sr_validate_io+0x44:    cmpl     $0,0x40(%r9)
ddb{2}>
This tells us where it crashed but not how the code flow ended up here.
Please show the stack trace printed by the 'trace' command, and the output
of the 'ps' command (both commands at the ddb> prompt).

# OpenBSD 6.8 RAID5 configuration with three 1TB "Samsung SSD PRO 860" drives 


sysctl hw.disknames

disklabel sd1
disklabel -E sd1
disklabel -E sd2
odisklabel -E sd3

bioctl -c 5 -l sd1a,sd2a,sd3a softraid0
disklabel -E sd4

newfs sd4a

obsdarc# mkdir /arc-3xssd
obsdarc# mount /dev/sd4a /arc-3xssd/                                            
                                                                                
                                                                                
                                                                              
obsdarc# df -h | grep 3xssd 
/dev/sd4a      1.8T    8.0K    1.8T     0%    /arc-3xssd





# ------------------------------------------------------------------------------
dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024

# Error messages

uvm_fault(0xffffffff821ede50, 0x40, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      sr_validate_io+0x44:    cmpl     $0,0x40(%r9)
ddb{4}>


# ------------------------------------------------------------------------------
obsdarc# disklabel sd1                                                          
                                                                                
                                                                                
                                                                              

# /dev/rsd1c:
type: SCSI
disk: SCSI disk
label: Samsung SSD 860 
duid: cb0d589d6d25894e
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 124519
total sectors: 2000409264
boundstart: 0
boundend: 2000409264
drivedata: 0 

16 partitions:
#                size           offset  fstype [fsize bsize   cpg]
  a:       2000409264                0    RAID                    
  c:       2000409264                0  unused  

...


# ------------------------------------------------------------------------------
obsdarc# disklabel sd4 

# /dev/rsd4c:
type: SCSI
disk: SCSI disk
label: SR RAID 5
duid: 2f9692cd2e3a048f
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 249039
total sectors: 4000817408
boundstart: 0
boundend: 4000817408
drivedata: 0 

16 partitions:
#                size           offset  fstype [fsize bsize   cpg]
  a:       4000817408                0  4.2BSD   8192 65536 52270 
  c:       4000817408                0  unused                    


# ------------------------------------------------------------------------------
obsdarc# dd if=/dev/urandom of=/arc-3xssd/1GB-urandom.bin bs=1M count=1024      
                                                                                
                                                                                
                                                                              


Reply via email to