In early 1971, I started working as an on-site software tech on a rather large 
(for the time), dual-processor Burroughs B6500. It was early days for that 
system, and we had a lot of problems with it in the field. We got those ironed 
out pretty well and eventually got the system up to an acceptable level of 
operation.

About a year later, a major upgrade to the MCP (the OS) was released. A major 
component of that release was a completely rewritten I/O subsystem with much 
higher reliability and much, much better performance, especially in "Logical 
I/O," the interface between user programs and the memory buffers and physical 
I/O mechanism. Soon after installing this release, we started to get fatal 
system crashes in Logical I/O.

Describing what was happening requires a little background on the system 
architecture. The B6500, like its predecessor the B5500, is known as an 
ALGOL-oriented stack machine, but it is less well-known as a type of capability 
system. To support that, it used a segmented memory model and tagged memory. 
Each word had an extra three bits, not accessible to user programs, that 
identified the type of data in a word. For example, tag 0 was ordinary data, 
tag 2 indicated a double-precision word, tag 5 was a data descriptor through 
which data segments were addressed, and tag 7 was a Program Control Word (PCW) 
that effectively addressed a location in an object-code segment. It was used 
primarily as a procedure (subroutine) entry-point address.

In reading the dumps from the crashes and the MCP source code, we started to 
learn how the new Logical I/O mechanism worked. It cleverly used the stack 
addressing of the system to implement a very object-oriented interface. The 
"methods" of this interface were small procedures that were customized to 
handle very specific cases of record handling -- random vs. sequential I/O, 
blocked vs. unblocked I/O, translation or no translation, etc.

There were scores of these methods. The idea was do a little as possible for 
each user request and to avoid making as many decisions as possible and to 
optimize the buffer handling in each case. There were about a half-dozen 
different types of user requests, and the methods for those were accessed 
through a branch table in the FIB (control block) for each open file. That 
branch table contained PCWs for the appropriate methods needed by that file. 
The table was set up during file open, but could be changed as the nature of 
the user program's requests changed, e.g., from sequential to random access.

We discovered that the crash was caused by some PCWs in the file-level branch 
tables having tags of 5 instead of tags of 7. Attempting to call a procedure 
using a tag-5 word was a no-no that was trapped by the hardware, hence the 
fatal dump. Then we discovered that the branch tables were loaded from a master 
array of PCWs for all of the possible methods, and when we looked at that array 
in the dump, ALL of the PCWs in the array had tags of 5! We know that array 
initially had to have had words with tags of 7, because the system had run for 
quite a while before crashing, so how could all of the words in the array 
suddenly have changed to tags of 5? There wasn't any straightforward way to do 
that in software.

That master array was loaded from from the OS image on disk at the initial 
boot, but then we found out the array was overlayable -- it could be paged out 
and back in by the virtual memory mechanism. So we began to suspect there was 
an issue with I/O. Normally, a disk read stored words in memory with a constant 
tag of 0, but there was a special I/O mode, termed "tag transfer," that would 
read and write the tag bits along with the regular data bits.

Fortunately, the other tech on the site had worked in the MCP group for a while 
and knew the I/O hardware pretty well, so he started writing some standalone 
programs to exercise the hardware in specific ways. This system had two I/O 
Multiplexors (multi-channel DMA units) numbered 0 and 1, and the disk drives 
were dual-ported so that either Mux could address any of them. My colleague's 
programs tried doing I/Os with various combinations of Mux and channel 
assignments. And as you might expect by now, he discovered that doing a tag 
transfer read through Mux 1 always dropped the middle tag bit.

We turned that finding over to the on-site Field Engineers, who pulled out 
their schematics and started tracing signals. It took several hours, but 
eventually they discovered not a loose wire, but that Mux 1 was completely 
missing a wire. Of course, it was the wire that carried the middle bit of the 
tag during tag transfer.

We finally deduced that the problem had been present since the system left the 
factory floor. The original I/O software had been so poor that the system was 
seldom (if ever) able to initiate more than one I/O to disk at a time, but the 
new version we had recently installed was really good at initiating multiple 
I/Os. Mux 1 had a lower selection priority than Mux 0, so under the old 
software it was seldom selected, and perhaps never so for tag transfer I/Os, 
which are relatively rare. The new software allowed the system to get busy 
enough that Mux 1 started to be used a lot more often, and eventually it got 
busy enough that a paging I/O for that master PCW array got scheduled to Mux 1, 
and the system just didn't survive for very long after that.

Reply via email to