Hi, We've been observing in our application several crashes on Windows related to RAND_poll(). We've been working on this issue for 3 days now, and came up with a possible explanation and fix. Bare with me on this rather lengthy email, as I'll try to document as best I can everything we've done and come up with.
The crash (an access violation, usually -but not always- a null pointer) always happens in Heap32Next(). Because RAND_poll uses Heap32Next to traverse through the heaps, our first assessment was that this function is not thread safe and would leave RAND_poll traversing through garbage data, causing an access violation. After some googling, several old posts on openssl mailing lists appeared to confirm our initial fear that RAND_poll implementation is not thread-safe. However, this does not seem to be a valid explanation. RAND_poll uses CreateToolhelp32Snapshot to create a snapshot of the heap list; snapshot that is supposedly safe. <Quote from MSDN> Snapshots are at the core of the tool help functions. A snapshot is a read-only copy of the current state of one or more of the following lists that reside in system memory: processes, threads, modules, and heaps. Processes that use tool help functions access these lists from snapshots instead of directly from the operating system. The lists in system memory change when processes are started and ended, threads are created and destroyed, executable modules are loaded and unloaded from system memory, and heaps are created and destroyed. The use of information from a snapshot prevents inconsistencies. Otherwise, changes to a list could possibly cause a thread to incorrectly traverse the list or cause an access violation (a GP fault). For example, if an application traverses the thread list while other threads are created or terminated, information that the application is using to traverse the thread list might become outdated and could cause an error for the application traversing the list. </Quote from MSDN> To check this, we did a small test that traverses the heap lists (as done in RAND_poll) while a dozen of other threads randomly allocates/deallocates memory. This test never crashed, hinting that CreateToolhelp32Snapshot could be indeed thread-safe (although not proving it). We also suspected that the crashes could be caused by heap corruption done by our application. In order to disprove this, we enabled full CRT checks on memory allocation/deallocations; no error came up. On top of that, MSDN documentation states that Heap32Next detects heap corruptions. <Quote from MSDN> (Heap32Next) Returns TRUE if information about the next block in the heap has been copied to the buffer or FALSE otherwise. The GetLastError function returns ERROR_NO_MORE_FILES when no more objects in the heap exist and ERROR_INVALID_DATA if the heap appears to be corrupt or is modified during the walk in such a way that Heap32Next cannot continue. </Quote from MSDN> Two things caught our attention. First, the MSDN documentation of CreateToolhelp32Snapshot is inconsistent between WinCE and "regular" Windows. An extra remarks appear on the WinCE documentation. <Quote from MSDN> Because the data captured by CreateToolhelp32Snapshot is static and the system is dynamic, use try-excepts around the APIs that access this data. </Quote from MSDN> The second thing is that the crashes in our application always occurred when the NVIDIA drivers were handling a wglMakeCurrent() from another thread (our application opens a OpenGL context while starting an SSL connection). Following the WinCE MSDN doc, we've protected the inner loop of the heap traversing with try/except blocks. Trials shows this fixes the crashes (the debugger shows the access violation is caught, and the loop moves on to the next heap list). See the attached source file for more information on the modifications we've made (just diff the file against OpenSSL 0.9.8j). Our current assessment is the following. - Either the MSDN documentation is incomplete, and CreateToolhelp32Snapshot's snapshots are not as safe as advertised (WinCE doc tends to hint that way). In which case try/exception blocks are mandatory. - Or external components beyond MS control (such as NVIDIA drivers in this case) break the safety guarantees of CreateToolhelp32Snapshot. As far as we know, our fix does not reduce the randomness of RAND_poll. Would it be premature to think about integrating it into OpenSSL? Considering WinCE MSDN documentation explicitly states that try/except blocks should be used, it might be wise to do it in any case. Regards, Tanguy
rand_win.c
Description: Binary data
