On 1/14/24 11:48, gene heskett wrote:
On 1/14/24 07:42, David Christensen wrote:
Re-ordered for clarity -- David.
And snipped by Gene as I updated

On 1/12/24 18:42, gene heskett wrote:
I just found an mbox file in my home directory, containing about 90 days worth of undelivered msgs from smartctl running as root.
Do you know how the mbox file got there?
No, it just appeared.

smartctl says my raid10 is dying, ...


Please post a console session with a command that displays the message.
This is a copy/paste of the second message in that file, the first from smartctl, followed by the last message in that file:

 From r...@coyote.coyote.den Wed Nov 02 00:29:05 2022
Return-path: <r...@coyote.coyote.den>
Envelope-to: r...@coyote.coyote.den
Delivery-date: Wed, 02 Nov 2022 00:29:05 -0400
Received: from root by coyote.coyote.den with local (Exim 4.94.2)


It looks like you configured Exim to put root's mailbox in your home directory, to make it easier to read (?).


         (envelope-from <r...@coyote.coyote.den>)
         id 1oq5NB-000DBx-15
         for r...@coyote.coyote.den; Wed, 02 Nov 2022 00:29:05 -0400
To: r...@coyote.coyote.den
Subject: SMART error (SelfTest) detected on host: coyote
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <e1oq5nb-000dbx...@coyote.coyote.den>
From: root <r...@coyote.coyote.den>
Date: Wed, 02 Nov 2022 00:29:05 -0400
Content-Length: 513
Lines: 16
Status: RO
X-Status:
X-Keywords:
X-UID: 2

This message was generated by the smartd daemon running on:

    host name:  coyote
    DNS domain: coyote.den

The following warning/error was logged by the smartd daemon:

Device: /dev/sde [SAT], Self-Test Log error count increased from 0 to 1

Device info:
Samsung SSD 870 EVO 1TB, S/N:S626NF0R302507V, WWN:5-002538-f413394ae, FW:SVT01B6Q, 1.00 TB

For details see host's SYSLOG.
...
I also note they are now very old messages but the file itself is dated Jan 7nth. And syslog has been rotated several times since.

I'm not expert at interpreting smartctl reports, but I do not see such in the smarttcl output now. going backwads thru the list, the 4th drive in the raid has had 3334 errors, as had the third drive with 3332 ettors, the 1st and 2nd are clean.

One stanza of the error report:
Error 3328 occurred at disk power-on lifetime: 21027 hours (876 days + 3 hours)


I believe "3328" is an error number, not the quantity of errors -- the smartd mail said the count increased from 0 to 1.


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining LifeTime(hours)  LBA_of_first_error # 1  Extended offline    Completed: read failure       50%     10917 1847474376 # 2  Extended offline    Completed: read failure       50%     10586 1847474376

So half the samsung 870's are on their way out. But nothing recent... So I am now trying to get a good rsync copy on another drive.


Before you conclude that two of the Samsung 870 1 TB's are dying, please run a SMART short test on all four:

# smartctl -t short /dev/disk/by-id/...


What a few minutes for the test to complete (10 minutes should be more than enough).


Then get full SMART reports and save them to files:

# smartctl -x /dev/disk/by-id/... > YYYYMMDD-HHMM-smartctl-x-MANF-MODEL-SERIAL.out


Then upload the SMART reports someplace we can see them and post the URL's.


* /home is on a RAID 10 with 2 @ mirror of 2 @ 1 TB Samsung 870 SSD?

I think thasts what you call a raid10


Okay.


* 4 @ 2 TB Gigastone SSD for a new RAID 10?

just installed, not mounted or made into a raid yet. WIP?


Okay.


What drives are connected to which ports?

4 Samsung 870 1T's are on the 1st added controller.
ATM 5, 2T gigastone's are on the 2nd, 16 port added controller
smarttcl says all 5 of those are fine.


Okay.


What is on the other 20 ports?
On the mobo? A big dvd writer and 2 other half T or 1T samsung drives from earlier 860 runs, not currently mounted.
No spinning rust anyplace
now. ...
A current lsblk:
gene@coyote:~$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINTS
sda           8:0    0 931.5G  0 disk
├─sda1        8:1    0 838.2G  0 part   /
├─sda2        8:2    0  46.8G  0 part   [SWAP]
└─sda3        8:3    0  46.6G  0 part   /tmp

sdb           8:16   1     0B  0 disk   is probably my camera, currently plugged in

sdc           8:32   1     0B  0 disk   is probably my brother MFP-J6920DW printer, always plugged in


So, your OS disk is a Samsung 1 TB SSD on port /dev/sda.


I do not see the second Samsung SSD (?). I would use dmesg(1) and grep(1) to figure out what /dev/sdb and /dev/sdc are:

# dmesg | egrep '/dev/sd[bc]'


first controller, 6 port
sdd           8:48   0 931.5G  0 disk
├─sdd1        8:49   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdd2        8:50   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdd3        8:51   0   1.5G  0 part
   └─md2       9:2    0     3G  0 raid10
sde           8:64   0 931.5G  0 disk
├─sde1        8:65   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sde2        8:66   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sde3        8:67   0   1.5G  0 part
   └─md2       9:2    0     3G  0 raid10
sdf           8:80   0 931.5G  0 disk
├─sdf1        8:81   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdf2        8:82   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdf3        8:83   0   1.5G  0 part
   └─md2       9:2    0     3G  0 raid10
sdg           8:96   0 931.5G  0 disk
├─sdg1        8:97   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdg2        8:98   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdg3        8:99   0   1.5G  0 part
   └─md2       9:2    0     3G  0 raid10


Okay.  Those are the 4 @ Samsung 870 1 TB SSD's.


It looks like you partitioned them for three RAID10'sP:

1.  900 GB first partitions for /home RAID10.

2.  30 GB second partitions for swap RAID10.

3.  What are the 1.5 GB third partitions for?


2nd controller, 16 ports, all 5 2T gigastone's
sdh           8:112  0   1.9T  0 disk
└─sdh1        8:113  0   1.9T  0 part
sdi           8:128  0   1.9T  0 disk
└─sdi1        8:129  0   1.9T  0 part
sdj           8:144  0   1.9T  0 disk
└─sdj1        8:145  0   1.9T  0 part
sdk           8:160  0   1.9T  0 disk
└─sdk1        8:161  0   1.9T  0 part
sdl           8:176  0   1.9T  0 disk
└─sdl1        8:177  0   1.9T  0 part
sr0          11:0    1  1024M  0 rom  The internal dvd writer
gene@coyote:~$


Those are the 5 @ Gigastone 2 TB SSD's, with one big partition on each.


 > blkid does not sort them in order either. And of coarse does not list
 > whats unmounted, forcing me to ident the drive by gparted in order to
 > get its device name. From that I might be able to construct another raid
 > from the 8T of 4 2T drives but its confusing as hell when the first of
 > those 2T drives is assigned /dev/sde and the next 4 on the new
 > controller are /dev/sdi, j, k, & l.


Use /dev/disk/by-id/* paths when referring to drives.


 > So it appears I have 5 of those gigastones, and sde is the odd one
Which when it was /dev/sde1, was plugged into the 1st extra controller
When the data cable was plugged into a motherboard port, it became /dev/sdb1.  So I've relabeled it, and about to test it on the second 16 port controller. >>

I am confused -- do you have 4 or 5 Gigastone 2 TB SSD?

5,  ordered in 2 separate orders.

 > So that one could be formatted ext4 and serve as a backup of the raid10.
What I am trying to do now, but cannot if it is plugged into a motherboard port, hence the repeat of this exercise on the 2nd sata card.

 > how do I make an image of that
 > raid10  to /dev/sde and get every byte?  That seems like the first step
 > to me.
This I am still trying to do, the first pass copied all 350G of /home but went to the wrong drive, and I had mounted the drive by its label.
It is now /dev/sdh and all labels above it are now wrong. Crazy.
These SSD's all have an OTP serial number. I am tempted to use that serial number as a label _I_ can control.


When I built and ran a Debian 2 @ HDD RAID1 using mdadm(8), I did not partiton the HDD's -- I gave mdadm(8) the whole drives.


And according to gparted, labels do not survive being incorporated into a raid as the raid is all labeled with hostname : partition number. So there really is no way in linux to define a drive that is that drive forever. Unreal...


Do what I did -- forget partitions and give the whole SSD's to mdadm(8). Make sure you zero or secure erase them first.


Please get a USB 3.x HDD, do a full backup of your entire computer, put it off-site, get another USB 3.x HDD, do another full backup, and keep it nearby

That, using amanda is the end target of this. But I have bought 3 such spinning rust drives over the years and not had any survive being hot plugged into a usb port more than twice.

With that track record, I'll not waste any more money down that rabbit hole.


Okay. I would not mind two big USB 3.x SSD's for backups, but I cannot justify the expense.


 >   But since I can't copy a locked file,

What file is lock?  Please post a console session that demonstrates.

A file that is opened but not closed is exclusive to that app and its lock, and cannot be copied except by rsync, or so I have been told.


AIUI that depends upon how locks are implemented -- advisory or enforced.


That said, you want to backup files when they are closed. Coordinating applications, services, and backups such that you obtain correct and consistent backup files every time is non-trivial. My SOHO network is easy -- I close all apps, do not use any services, and run my backup script.


And there are quite a few such open locks on this system right now.


If you installed Debian onto a USB drive (flash or SSD), you could boot that, mount your disks/ RAID's read-only, and run your backups without any open or locked files.


This killed my full housed amiga when the boot drive with all its custom scripts died, and I found the backups I had were totally devoid of any of those scripts.


That is a good reason to validate your backup/ restore processes.


I still have about 20 QIC tapes from that machine, but now no drives to read them. I need to cull the midden heap.


That is a good reason to backup/ archive onto multiple media types.


David

Reply via email to