Hello developers, there are plenty of reports in the internet that the Windows 98 installer crashes or hangs in qemu. I took the effort to track down what causes these problems, and I think I found out the core reason, which seems to be a bug in the Microsoft DOS Extender DOSX.
The Windows 95/Windows 98 installers are Windows 3.1 applications, and the setup media contain the Windows 3.1 kernel for "standard mode", i.e. the 286 mode of Windows 3.1. The lowest layer of Windows 3.1 running in standard mode is the Microsoft DOS extender, which amongst other things provides a DPMI host implementation and does interrupt management. The crashes of the Windows 98 installer I could observe were caused by overflowing the number of interrupt stacks inside DOSX, which can happen if interrupts are generated faster than they are handled. The code path is like this: While DOSX is active and executing real-mode code with interrupts enabled, an interrupt occurs (e.g. the timer interrupt). All real mode interrupt handlers are hooked by dosx, so control is transferred to the corresponding interrupt handler in dosx. The handler for interrupts occurring in real mode reflects the interrupt to protected mode. The reflection to protected mode happens on one of the internal interrupts stacks inside DOSX. After setting up the interrupt stack and looking up the protected mode handler, an interrupt return frame for the protected mode handler is set up containing the flag register value that was active when the real-mode handler in DOSX was entered (i.e. the return flags from the DOSX handler are copied to the interrupt stack). The protected mode interrupt handler in SYSTEM.DRV then at some time decides to chain to the original protected mode interrupt handler inside DOSX, either by jumping to the handler re-using the return frame (and thus the return flags the DOSX handler will see are the same as the code that reflected the interrupt to protected mode had seen), or on another code path that has the same net effect [skipped as it does not matter for the issue here]. So now DOSX is entered again. The default protected mode interrupt handler then decides to reflect the interrupt to real mode - to all the code that hooked the interrupt before DOSX was called. Just as for the reflect-to-protected-mode code, also the reflect-to-real-mode code allocates an interrupt stack from the stacks inside DOSX, switches to that stack, and finally calls the original handler (this time in real mode), with the return frame having the same flags as the return frame of the reflection handler. Long story short: So the flags from when the hardware interrupt handler was entered were passed along into the return frame the reflecting handler builds for the protected mode handler. The flags from this return frame are then passed into the return frame of the second reflecting handler builds for the real mode handler. As interrupts were enabled at the start of that chain (otherwise, it would not have started), we know that the interrupt flag is set in the return frame of the real-mode handler. Also, note that two interrupt stacks got allocated during this process. (the total number of interrupt stacks is 12 by default, which is not overwritten in the system.ini provided with the Windows 98 installer) Now let's assume for some reason the real-mode handler of the timer interrupt takes more than 55ms to execute (or execution is scheduled from qemu to another process so that not 55ms of real CPU time is available between two timer ticks), then the next timer tick is pending as soon as the real-mode handler of the timer interrupt returns into the reflect-to-real-mode handler (which is going switch back to protected mode and return to either SYSTEM.DRV or the reflect-to-protected mode handler and freeing the interrupt stack used for reflection to real mode). BUT as we know, the interrupt flag is set in the interrupt return frame for the real-mode handler - which causes qemu to accept the next timer interrupt directly after the real mode handler returned, with two interrupt stacks still allocated. If the nesting level gets to six, all interrupt stack frames are used. DOSX still allocates further stack frames, resulting in the stack pointer pointing into the data segment of DOSX, damaging important data structures, which will crash the system some time later. If you know the 8086 architecture by heart, and also know the qemu code, you could get the idea that there might be an emulation bug causing the premature acceptance of the second interrupt (would it be accepted after cleaning up the stack frames, there would be no problem), namely that after an IRET or STI instruction, interrupts are only accepted after one further instruction - and only if they are still enabled. So *if* the real-mode handler returned to an CLI instruction, a real 8086 compatible CPU would not accept an interrupt between the IRET and CLI. Indeed, the DOSX code contains an CLI instruction in the code that tears down the allocated interrupt stack after the real mode handler returned, but it is not the first, but the third instruction - which is too late even on real hardware. Tp be exact, the code at the return point of the real mode handler inside DOSX is "pop ds / pushf / cli". I don't have any solution for that problem at hand, and I can't say for sure that this nesting of timer interrupts really is a problem if you don't trace qemu with "-d in_asm,cpu" (not tracing should make it faster), but the kind of crashes I saw with and without tracing were similar, so I expect interrupt stack overflow to cause the crashes observed in the Windows 98 installer. The main reason I am writing this mail is to archive the knowledge about what happens inside the installer, so this tedious tracing process doesn't have to be reproduced by somebody else interested in fixing the problem, but I am happy to hear suggestions on how this problem can be fixed or worked around. Regards, Michael Karcher
signature.asc
Description: This is a digitally signed message part