In early 1971, I started working as an on-site software tech on a rather large (for the time), dual-processor Burroughs B6500. It was early days for that system, and we had a lot of problems with it in the field. We got those ironed out pretty well and eventually got the system up to an acceptable level of operation.
About a year later, a major upgrade to the MCP (the OS) was released. A major component of that release was a completely rewritten I/O subsystem with much higher reliability and much, much better performance, especially in "Logical I/O," the interface between user programs and the memory buffers and physical I/O mechanism. Soon after installing this release, we started to get fatal system crashes in Logical I/O. Describing what was happening requires a little background on the system architecture. The B6500, like its predecessor the B5500, is known as an ALGOL-oriented stack machine, but it is less well-known as a type of capability system. To support that, it used a segmented memory model and tagged memory. Each word had an extra three bits, not accessible to user programs, that identified the type of data in a word. For example, tag 0 was ordinary data, tag 2 indicated a double-precision word, tag 5 was a data descriptor through which data segments were addressed, and tag 7 was a Program Control Word (PCW) that effectively addressed a location in an object-code segment. It was used primarily as a procedure (subroutine) entry-point address. In reading the dumps from the crashes and the MCP source code, we started to learn how the new Logical I/O mechanism worked. It cleverly used the stack addressing of the system to implement a very object-oriented interface. The "methods" of this interface were small procedures that were customized to handle very specific cases of record handling -- random vs. sequential I/O, blocked vs. unblocked I/O, translation or no translation, etc. There were scores of these methods. The idea was do a little as possible for each user request and to avoid making as many decisions as possible and to optimize the buffer handling in each case. There were about a half-dozen different types of user requests, and the methods for those were accessed through a branch table in the FIB (control block) for each open file. That branch table contained PCWs for the appropriate methods needed by that file. The table was set up during file open, but could be changed as the nature of the user program's requests changed, e.g., from sequential to random access. We discovered that the crash was caused by some PCWs in the file-level branch tables having tags of 5 instead of tags of 7. Attempting to call a procedure using a tag-5 word was a no-no that was trapped by the hardware, hence the fatal dump. Then we discovered that the branch tables were loaded from a master array of PCWs for all of the possible methods, and when we looked at that array in the dump, ALL of the PCWs in the array had tags of 5! We know that array initially had to have had words with tags of 7, because the system had run for quite a while before crashing, so how could all of the words in the array suddenly have changed to tags of 5? There wasn't any straightforward way to do that in software. That master array was loaded from from the OS image on disk at the initial boot, but then we found out the array was overlayable -- it could be paged out and back in by the virtual memory mechanism. So we began to suspect there was an issue with I/O. Normally, a disk read stored words in memory with a constant tag of 0, but there was a special I/O mode, termed "tag transfer," that would read and write the tag bits along with the regular data bits. Fortunately, the other tech on the site had worked in the MCP group for a while and knew the I/O hardware pretty well, so he started writing some standalone programs to exercise the hardware in specific ways. This system had two I/O Multiplexors (multi-channel DMA units) numbered 0 and 1, and the disk drives were dual-ported so that either Mux could address any of them. My colleague's programs tried doing I/Os with various combinations of Mux and channel assignments. And as you might expect by now, he discovered that doing a tag transfer read through Mux 1 always dropped the middle tag bit. We turned that finding over to the on-site Field Engineers, who pulled out their schematics and started tracing signals. It took several hours, but eventually they discovered not a loose wire, but that Mux 1 was completely missing a wire. Of course, it was the wire that carried the middle bit of the tag during tag transfer. We finally deduced that the problem had been present since the system left the factory floor. The original I/O software had been so poor that the system was seldom (if ever) able to initiate more than one I/O to disk at a time, but the new version we had recently installed was really good at initiating multiple I/Os. Mux 1 had a lower selection priority than Mux 0, so under the old software it was seldom selected, and perhaps never so for tag transfer I/Os, which are relatively rare. The new software allowed the system to get busy enough that Mux 1 started to be used a lot more often, and eventually it got busy enough that a paging I/O for that master PCW array got scheduled to Mux 1, and the system just didn't survive for very long after that.