On Fri, 2024-01-26 at 16:11 +0100, hw wrote: > I've never had issues with any UPS due to self tests. The batteries > need to be replaced when they are worn out. How often that is > required depends on the UPS and the conditions it is working in, > usually every 3--5 years.
It was with some small to mid APC model, I think. We had about 1 to 2kW worth of servers on it, so it was not that small, definitely no consumer type. When I took over maintenance somebody had configured some sort of weekly or biweekly self-test, that switched over to battery, was supposed to run the battery down to 25% or similar, and then return to mains power/charging. Except once what the UPS considered 25% charge seemingly was not, and everything shut down instantly. > I rather spend the money on new batteries (EUR 40 last time after 5 > years) every couple years rather than spending thousands on replacing > the hardware when a power surge damages it which could have been > prevented by the UPS, and it's better to have the machines shut down > properly rather taking risks with potential data loss, regardless of > file systems and RAID setups in use. I think having hardware for "thousands" and having a UPS with that cheap batteries is not that common. In above company we certainly had hardware for thousands, but changing batteries cost hundreds of Euros, even with off-brand aftermarket parts. It also was complicated to order the right parts etc. > RAID isn't as complicated as you think. Hardware RAID is most > simple, > followed by btrfs, followed by mdadm. I have to disagree with that too. Some hardware RAIDs might be simple, but others are not. Tracking down the rebrandings of Adaptec, aquisitions and mergers, is a science by itself. As is finding and installing their Firmware and utilites. Are they still calles Avago, or something new again? Or all that BBU stuff: Tracking the state of battery backup units on the controller, and ordering and replacing the correct battery is also not really easy. Clearly enterprise IT type of stuff, keeping even knowledgeable people busy for hours, if you don't do it at scale and regularily. Also often Linux support is problematic. Yes, it will work, but sometimes certain utilities are not available or work as good as with Windows. On the other hand mdadm software RAID is well documented and painless. > > With hardware RAID I can instruct someone who has no idea what > they're > doing to replace a failed disk remotely. Same goes for btrfs and > mdadm, though it better be someone who isn't entirely clueless In fact this was my job for some time: Administering hardware RAID equipped servers, and instructing "remote hands" or customers to swap harddisks. It was not always easy, not always were the correct disks pulled, even though it was correctly labelled. Sometimes clueless people tried swapping by themselves, mixing stuff up. We also had one server with wrong labelling, for whatever reason. That was no fun ;) Now I won't dispute that RAID has its place in data centers and many other applications. I just doubt that it is the correct choice for many home users. > More importantly, the hassle involved in trying to recover from a > failed disk is ridiculously enormous without RAID and can get > expensive when hours of work were lost. With RAID, you don't even > notice unless you keep an eye on it, and when a disk has failed, you > simply order a replacement and plug it in. Yes, that can happen. But more often than not the scenario is like it is with most notebooks today. You send your notebook in for repair, and have to reinstall anyway. Happened to me. I backed up my Debian system, sent the device in for hardware repair, got it back with Windows 10 ;) And no, it was not the disk that was broken, but the touchpad. > > It's not like you could go to a hardware store around the corner and > get a new disk same or next day. Even if you have a store around, > they will need to order the disk, and that can, these days, take > weeks > or months or longer if it's a small store. For consumer hard disks? I just go to my favourite shop if I need a replacement, and they've got maybe 20 or 30 types of hard disk in stock, to be bought right away. Even more with SSDs. And I am in a smallish city, pop. 250.000. > That is simply wrong. RAID doesn't protect you from malware, and > nothing protects you from user error. If you have data losses from > malware and/or user error more often than from failed disks, you're > doing something majorly wrong. In my experience user error is the main source of data loss. By far. > This shows that you have no experience with RAID and is not an > argument. I've got years of experience with RAID, both in my personal use and with employers doing stuff on RAID for customers and internal services. In my experience RAID is a nice solution for data center type setups. RAID often is problematic for home users or even small offices. > Making backups is way more complicated than RAID. You can way more > easily overwrite the wrong backup or misinterpret error messages of > your backup solution than you can pull the wrong disk from a RAID or > misinterpret error messages from your RAID. Yes. Making backups is hard. But my main point is: You need good backups anyway. So RAID does not help you here. If you neglect backups because you've got RAID, you are living dangerously. And some people actually do this. I think it is wrong and asking for problems. > How exactly would you pull the wrong disk from a RAID and thus cause > data loss? Before you pull one, you make a backup. Often people with RAID setups do not do this (making a full backup before a disk swap) for various reasons. > When the disk has > been pulled, its contents remain unchanged and when you put it back > in, your data is still there --- plus you have the backup. Sure it > can sometimes be difficult to tell which disk you need to replace, > and it's not an issue because you can always figure out which one you > need to replace. You can always tell with a good hardware RAID > because it will indicate on the trays which disk has failed and the > controller tells you. I've seen (well not seen, they were on the other end of the phone) people delay disk swaps in RAID until not one but two (of 5 or 6) physical drives were broken. Then they were instructed to pull e.g. drive 2 and 3, pulled the wrong ones (e.g. 1 and 2), they tried to correct their error by pulling another one, and boom. Yes, in theory data probably were still there and could have been restored with forensic techniques. But by that time the array was offline, and it was major repair/data rescue/restore backup time. > No, I generally don't have spares, and I don't leave my backup server > running all the time to make backups every few hours or every day > because electricity is way too expensive, plus it's somewhat loud and > gives off quite a bit of heat. This is the personal compromise that I make. > How often do you verify that you can actually restore everything from > your backups, and how do you do that? Well I don't regularily try to restore "everything", but as some stuff breaks regularily, most often my laptop, I get to try out replacing systems once in a while. The last disk I had fail on me was an SSD in my Proxmox server at home, I did a clean install of Proxmox, restored guest backups, worked like a charm. Finding a replacement disk was the hardest part. Also when migrating to new hardware, it is a good time to test your backups. When I moved my Home Assistant from one hardware to a new type of setup on a different hardware, I implicitly tested the backups. Worked flawlessly. /ralph

