Protesilaos Stavrou wrote: > "Basil L. Contovounesios" wrote: > > Bob Proulx writes: > >> Are all of those messages yours? They all have the same unique string > >> pattern. > > > > This pattern is generated by an Emacs MUA. The @tcd.ie ones are mine, > > and the @protesilaos.com ones are Prot's (CCed). I think I received the > > messages locally, but they're clearly missing from > > https://bugs.gnu.org/45068 and possibly other places too. Should I just > > resend the missing messages? > > Hello! I noticed that they were missing, but assumed that the sync > takes some time. > > Please re-send them or tell me how I can do it from here.
When I was provided with a message-id by Lars for one of his missing messages I was able to grep around and find that message and the others in the logs. The logs said those message-ids had been discarded. That's all I know. Sorry. The group of those all together just stood out as looking unusual to my eye and therefore I mentioned it. I don't know if there is a systematic failure that needs to be fixed or if it was simply human error due to the systems problems and the large spam backlog. One of the contributing factors may have been related to the storage array problems yesterday. When a system can't read or write files the process trying to do so gets "blocked waiting for I/O" and pauses in an uninterruptable wait state. (In the Linux kernel a ps listing shows this uninterruptible state as the "D" state.) Since most OS functions get cached in the file system buffer cache in RAM the OS on most systems were still able to function at some level of functionality. As far as I know none of the systems outright crashed. But these processes blocked waiting for I/O from the networked storage server did pile up. I saw that fencepost had a system load of more than 1100! The FSF admins worked almost all day long Sunday morning through late afternoon to restore the storage array. As you can imagine it was a high stress situation for them. Meanwhile after the initial couple of hours the rest of the systems were mostly restored to normal operation and they were able to drain down their high cpu load averages. Those uninterruptible processes completed their I/O reads and writes upon which they were blocked and were able to exit. However after being blocked for a long time some processes that have timeouts will time out and be killed for taking too long to complete. The large mail backlog that occurred yesterday which meant that humans looking at the mailman web page hold queue were looking at dozens and dozens of messages most of which were spam because the anti-spam "cancel bot" was also backlogged. That's almost worst case for a human looking at mail messages and trying to pick out the non-spam messages from the sea of spam. But I really have no idea about any particular message and am just guessing. I also don't know the deep details of the storage array problems either. Perhaps the FSF admins will write up a blog note about it. That would be interesting to me. From what I could tell there was a coupled failure of multiple controller nodes causing the array to lose redundancy. At least one of the arrays went offline completely. They had to carefully reset and restore redundancy quorom of the disk storage and the controller nodes. Other than the initial hour when things were completely offline the subsequent restoration was all done online and running while the system was functioning in a degraded raid mode. Which is pretty cool when you think of it! > [ I am using Emacs+Gnus and this setup has been stable for a fairly long time > ] Emacs+Gnus worked great. No problems there at all. The only reason that Emacs+Gnus got mentioned was that it created a message-id format that I did not recognize and therefore asked if those were all from Lars. Basil told me those were from Emacs. Which is great. No problems there at all. Bob