Re: Data and hardware protection measures

2024-01-28 Thread Felix Miata
Michael Kjörling composed on 2024-01-28 19:23 (UTC):

> On 28 Jan 2024 19:19 +0100, from h...@adminart.net (hw):

>> On Fri, 2024-01-26 at 15:56 +, Michael Kjörling wrote:

>>> It's also worth talking to your local electrician about installing an
>>> incoming-mains overvoltage protection for lightning protection.

>> Hm I thought it's expensive.

> So did I until I actually asked someone who could give me a quote for
> actually installing it.

Old construction used meter "boxes" that don't support accessories like those. 
My
utility won't touch mine unless I pay an electrician for an expensive full 
service
upgrade.
-- 
Evolution as taught in public schools is, like religion,
based on faith, not based on science.

 Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata



Re: Data and hardware protection measures; was: rsync --delete vs rsync --delete-after

2024-01-28 Thread Brad Rogers
On Sun, 28 Jan 2024 19:19:55 +0100
hw  wrote:

Hello hw,

>How do you know in advance when the battery will have failed?

Even my very basic UPS (APC Backup 1400) has a light on the front
labelled "Replace Battery".  That, combined with a very annoying high
pitch scream, are pretty good motivators to do the job.

I know the Backup 1400 was mentioned in this thread as "probably avoid"
(or something similar), but it's served me well thus far.  Had to replace
the battery pack only once.  That was after ten years, not the three to
five that people have been talking about.  APC no longer sell that
model, but battery packs are still available.  Just as an FYI, the
battery packs are sealed Lead-Acid.

Where I live (UK), it's possible to sell lead-acid batteries to scrap
merchants.  Amount paid is variable and subject to massive market forces
that are best described as 'volatile'.

Like others have mentioned with some of the more basic APC devices, this
particular model isn't designed with user replaceable batteries in mind,
but it's not an overly difficult task.  It can't easily (if at all) be
done leaving connected devices powered up, though.

-- 
 Regards  _   "Valid sig separator is {dash}{dash}{space}"
 / )  "The blindingly obvious is never immediately apparent"
/ _)rad   "Is it only me that has a working delete key?"
They take away our freedom in the name of liberty
Suspect Device - Stiff Little Fingers


pgpqAXSSxvoLF.pgp
Description: OpenPGP digital signature


Re: Data and hardware protection measures

2024-01-28 Thread Michael Kjörling
On 28 Jan 2024 19:19 +0100, from h...@adminart.net (hw):
> On Fri, 2024-01-26 at 15:56 +, Michael Kjörling wrote:
>> On 26 Jan 2024 16:11 +0100, from h...@adminart.net (hw):
>>> I rather spend the money on new batteries (EUR 40 last time after 5
>>> years) every couple years [...]
> 
> To comment myself, I think was 3 years, not 5, sorry.
> 
>>> The hardware is usually extremely difficult --- and may be impossible
>>> --- to replace.
>> 
>> And let's not forget that you can _plan_ to perform the battery
>> replacement for whenever that is convenient.
> 
> How do you know in advance when the battery will have failed?

You replace the battery before it fails completely.

Most batteries don't go from perfectly fine to completely dead within
one charge cycle.

If the battery drains completely during a power outage before the UPS
has a chance to respond to the battery's loss of capacity, that
becomes a (hopefully clean) power cut, which _still_ is _a lot_ better
than equipment which isn't designed to deal with a significant
overvoltage condition taking the brunt of a lightning strike.

I'm assuming, of course, that you replace the battery with one of the
same chemistry. The UPS will probably assume some discharge
characteristic depending on what battery type the OEM uses (lead acid,
NiCd, NiMH, LiIon, ...); of course if you give the UPS a battery using
some other chemistry, that'll immediately wreak havoc with lots of
things.


>> Which is quite the contrast to a lightning strike blowing out even
>> _just_ the PSU and it needing replacement before you can even use
>> the computer again (and you _hope_ that nothing more took a hit,
>> which it probably did even if the computer _seems_ to be working
>> fine).
> 
> It would also hit the display(s), the switches and through that
> everything that's connected to the network, the server(s) ...  That
> adds up to a lot of money.

Which is why I said "even _just_ the PSU", emphasis original.


>> It's also worth talking to your local electrician about installing an
>> incoming-mains overvoltage protection for lightning protection.
> 
> Hm I thought it's expensive.

So did I until I actually asked someone who could give me a quote for
actually installing it.


> That doesn't exactly help when the failed disk has disappeared
> altogether, as if it had been removed ;)

If that happens, I'd get output along the lines of:

# zpool status
  pool: tank
 state: DEGRADED
  scan: scrub repaired B in  with  errors on 
config:

NAME  STATE READ WRITE CKSUM
tank  ONLINE   0 0 0
  raidz2-0ONLINE   0 0 0
wwn-0x0001-crypt  ONLINE   0 0 0
8446744073709551616   UNAVAIL  0 0 0  was 
/dev/mapper/wwn-0x1113-crypt
wwn-0x2225-crypt  ONLINE   0 0 0
wwn-0x3337-crypt  ONLINE   0 0 0
wwn-0x4449-crypt  ONLINE   0 0 0
wwn-0x555b-crypt  ONLINE   0 0 0

clearly identifying the problem. And also most likely a lot of event
notifications telling me that wwn-0x1113-crypt is having
issues within the "tank" pool, plus any applicable kernel logs for the
device disconnection and perhaps lower-level I/O errors. Similarly, if
a storage device suddenly starts returning garbage, that will show up
likely as CKSUM errors and the device will eventually get kicked out
of the pool, showing as state FAILED with large error counter values.

(zpool status would also provide some more explanatory details, in the
example above including that "applications are unaffected" because
sufficient redundancy would still exist; but I'm eliding those here
because I don't have them handy and don't feel like creating such a
situation just to get example output. The important part is that the
disk that dropped off the bus will show as likely UNAVAIL with its
internal identifier and a reference to its WWN because of my naming
scheme, instead of as completely missing. Solution is to get a
replacement disk, plug it in, execute "sudo zpool replace tank
$numeric_id $new_device_path", and wait a while, all the while I can
still use the system normally.)

No matter what kind of storage solution you're using - hardware RAID,
software RAID, no redundancy, whichever - or how you're doing backups
(assuming that you are, for some value of "you"), you can't just
ignore issues with it. That way lies data loss.

-- 
Michael Kjörling  https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”



Re: Data and hardware protection measures; was: rsync --delete vs rsync --delete-after

2024-01-28 Thread hw
On Fri, 2024-01-26 at 15:56 +, Michael Kjörling wrote:
> On 26 Jan 2024 16:11 +0100, from h...@adminart.net (hw):
> > I rather spend the money on new batteries (EUR 40 last time after 5
> > years) every couple years [...]

To comment myself, I think was 3 years, not 5, sorry.

> > The hardware is usually extremely difficult --- and may be impossible
> > --- to replace.
> 
> And let's not forget that you can _plan_ to perform the battery
> replacement for whenever that is convenient.

How do you know in advance when the battery will have failed?

> Which is quite the contrast to a lightning strike blowing out even
> _just_ the PSU and it needing replacement before you can even use
> the computer again (and you _hope_ that nothing more took a hit,
> which it probably did even if the computer _seems_ to be working
> fine).

It would also hit the display(s), the switches and through that
everything that's connected to the network, the server(s) ...  That
adds up to a lot of money.

> [...]
> It's also worth talking to your local electrician about installing an
> incoming-mains overvoltage protection for lightning protection. I
> won't quote prices because I had mine installed a good while ago and
> also did it together with some other electrical work, but I was
> surprised at how low the cost for that was, and I _know_ that it has
> saved me on at least one occasion.

Hm I thought it's expensive.  I'll ask when I get a chance.

> [...]
> > You can always tell with a good hardware RAID because it
> > will indicate on the trays which disk has failed and the controller
> > tells you.
> 
> Or you can label the physical disks. Whenever I replace a disk, I
> print a label with the WWN of the new disk and place it so that it is
> readable without removing any disks or cabling;

That doesn't exactly help when the failed disk has disappeared
altogether, as if it had been removed ;)

But then, you can go by the numbers of the disks you can still see.

And beware of SSDs; when they fail, they're usually entirely
inaccessible whereas you may be still able to resuce (some) data from
a spinning disk after it failed.

It's probably really bad with mainbaords that use M2 storage since
apparently, they seem to support only one (of the some type at least)
rather than two.  So you can't use those at all.  What's the point of
that?  ZFS cache maybe?



Re: Data and hardware protection measures; was: rsync --delete vs rsync --delete-after

2024-01-26 Thread Michael Kjörling
On 26 Jan 2024 16:11 +0100, from h...@adminart.net (hw):
> I rather spend the money on new batteries (EUR 40 last time after 5
> years) every couple years [...]
> 
> The hardware is usually extremely difficult --- and may be impossible
> --- to replace.

And let's not forget that you can _plan_ to perform the battery
replacement for whenever that is convenient. Which is quite the
contrast to a lightning strike blowing out even _just_ the PSU and it
needing replacement before you can even use the computer again (and
you _hope_ that nothing more took a hit, which it probably did even if
the computer _seems_ to be working fine).


>> I've had no external power outage in the last 5 or 10 years, but a UPS
>> often needs at least one battery replacement during that time.
> 
> Outages are (still) rare here, but it suffices to trigger a fuse or
> the main switch when some device shorts out, or someone working on the
> solar power systems some of the neighbours have, causing crazy voltage
> fluctuations, or a lightning strike somewhere in the vinicity or
> whatever reason for an UPS to be required.

It's also worth talking to your local electrician about installing an
incoming-mains overvoltage protection for lightning protection. I
won't quote prices because I had mine installed a good while ago and
also did it together with some other electrical work, but I was
surprised at how low the cost for that was, and I _know_ that it has
saved me on at least one occasion. It won't do power conditioning or
power loss protection of course, but it _does_ greatly increase the
odds that your home wiring survives a lightning-related voltage surge.
(Nothing will realistically protect you against a _direct_ lightning
strike; in that case the very best you can hope for is damage
containment.)


> More importantly, the hassle involved in trying to recover from a
> failed disk is ridiculously enormous without RAID and can get
> expensive when hours of work were lost.  With RAID, you don't even
> notice unless you keep an eye on it, and when a disk has failed, you
> simply order a replacement and plug it in.

Indeed; the point of RAID is uptime.


> You can always tell with a good hardware RAID because it
> will indicate on the trays which disk has failed and the controller
> tells you.

Or you can label the physical disks. Whenever I replace a disk, I
print a label with the WWN of the new disk and place it so that it is
readable without removing any disks or cabling; then I use the WWN to
identify the disk in software; in both cases because the WWN is a
stable identifier that I can fully expect will never change throughout
the disk's lifetime. So when the system tells me that
wwn-0x123456789abcdef0 is having issues, I can quickly and accurately
identify the exact physical device that needs replacement once I have
a replacement on hand. And if the kernel logs are telling me that,
say, sdg is having issues, I can map that back to whatever WWN happens
to map to that identifier at that particular time. (In practice, I'm
more likely to get useful error details through ZFS status monitoring
tools, where I already use the WWN, so I likely won't need to go that
somewhat circuitous route.)


> Yes, my setup is far from ideal when it comes to backups in that I
> should make backups more frequently.  That doesn't mean I shouldn't
> have good backups and that UPSs and RAID were not required.

Or that, again, they solve different problems.

-- 
Michael Kjörling  https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”