Hi Packmans, I know, it is lame to self-reply, ... but anyhow ... a tl;dr is at the end of the mail :)
Am Montag, 20. April 2020, 15:21:42 CEST schrieb Stefan Botter: ... > I hope, I can give more insight in the next few days. Now is "next few days". What happened? The initial problem arose during the evening hours of Apr 1st, when a rather unusual blackout hit the part of town, where my servers are hosted. I have a UPS, but it supports for 8-10 minutes only, and the blackout lasted 30 minutes. There should be emergency power by means of a diesel generator (which by-the-way was scheduled to be replaced the following weekend, but this is postponed due to COVID-19), but for unknown reason the generator did not kick in. I could restart everything Thursday morning. A secondary problem surfaced, it affected the whole system badly, and I have been rather clueless until today. PMBS runs along my personal VMs as a VMware guest on my lab system (two ESXi hosts). The lab is setup according to best practices, with two network facing switches, and two separate switches for storage. The storage device is a Synology DS620 with 4 1TB SSDs, connected via iSCSI. Backup is done inside the storage network to a separate DS216+II, and until Apr 10th was done by Synology's Advanced Backup for Business, which basically does snapshots of the VMs, and copies the changed blocks to the backup storage space. Since the blackout every time backup ran, at least one of the ESXi hosts froze or lost network connectivity. Since Apr 15th PMBS is now backed up by simple means of rsync, there is one backup copy created daily. This does not seem to put such a heavy strain on the network. I am still contemplating a versioned backup with rdiff-backup, which I use regularly with my other machines, but I am not sure, if my available backup space will be sufficient, and how long backup runs take on PMBS. So this is on the "maybe-ToDo-list". Still I did not know the cause of the lock-ups. By chance I discovered an almost similar behavior with network interruptions early last week, when upon a download of a VM image to my home system network connectivity was lost. It recovered automagically after 10-30 minutes, and was reproducible. Over the course of the weekend and today I managed to investigate further, and found that one of the network add-in cards in one of the servers acted strangely under load. I reconfigured the ESXi servers to use the lan-on-mainboard (LOM) adapters only, and am now more convinced, that the system runs stable again. I have some spare quad-port cards lying around, and will replace the thought-to-be-defective adapters some time in the future, to have the lab again conforming to best practices, but for now everything should work without frequent interruptions. As the world-wide COVID-19 calamity and the now emergency-emerging ;) changes to schooling environment is putting a heavy demand for immediate action by the school's IT, I have been having rather few time to work on "personal fun", it took a while longer to resolve the branching issue, which caused this thread. The cause of the reported errors were based on the frequent unwanted shutdowns, which left some state-recording files for sourceserver and schedulers with binary garbage at the end. I thought it was a good idea to document the events and sort-of- solution, for you to enjoy, and me to remember, as I will probably forget what happened and what I did in a few weeks :) tl;dr: everything should work again without frequent interruptions. Greetings, Stefan -- Stefan Botter zu Hause Bremen
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Packman mailing list [email protected] http://lists.links2linux.de/cgi-bin/mailman/listinfo/packman
