Hi,

We've been observing in our application several crashes on Windows related to 
RAND_poll(). We've been working on this issue for 3 days now, and came up with 
a possible explanation and fix. Bare with me on this rather lengthy email, as 
I'll try to document as best I can everything we've done and come up with.


The crash (an access violation, usually -but not always- a null pointer) always 
happens in Heap32Next().

Because RAND_poll uses Heap32Next to traverse through the heaps, our first 
assessment was that this function is not thread safe and would leave RAND_poll 
traversing through garbage data, causing an access violation. After some 
googling, several old posts on openssl mailing lists appeared to confirm our 
initial fear that RAND_poll implementation is not thread-safe.

However, this does not seem to be a valid explanation. RAND_poll uses 
CreateToolhelp32Snapshot to create a snapshot of the heap list; snapshot that 
is supposedly safe.


<Quote from MSDN>
Snapshots are at the core of the tool help functions. A snapshot is a read-only 
copy of the current state of one or more of the following lists that reside in 
system memory: processes, threads, modules, and heaps.

Processes that use tool help functions access these lists from snapshots 
instead of directly from the operating system. The lists in system memory 
change when processes are started and ended, threads are created and destroyed, 
executable modules are loaded and unloaded from system memory, and heaps are 
created and destroyed. The use of information from a snapshot prevents 
inconsistencies. Otherwise, changes to a list could possibly cause a thread to 
incorrectly traverse the list or cause an access violation (a GP fault). For 
example, if an application traverses the thread list while other threads are 
created or terminated, information that the application is using to traverse 
the thread list might become outdated and could cause an error for the 
application traversing the list.
</Quote from MSDN>


To check this, we did a small test that traverses the heap lists (as done in 
RAND_poll) while a dozen of other threads randomly allocates/deallocates 
memory. This test never crashed, hinting that CreateToolhelp32Snapshot could be 
indeed thread-safe (although not proving it).

We also suspected that the crashes could be caused by heap corruption done by 
our application. In order to disprove this, we enabled full CRT checks on 
memory allocation/deallocations; no error came up. On top of that, MSDN 
documentation states that Heap32Next detects heap corruptions.


<Quote from MSDN>
(Heap32Next) Returns TRUE if information about the next block in the heap has 
been copied to the buffer or FALSE otherwise. The GetLastError function returns 
ERROR_NO_MORE_FILES when no more objects in the heap exist and 
ERROR_INVALID_DATA if the heap appears to be corrupt or is modified during the 
walk in such a way that Heap32Next cannot continue.
</Quote from MSDN>


Two things caught our attention. First, the MSDN documentation of 
CreateToolhelp32Snapshot is inconsistent between WinCE and "regular" Windows. 
An extra remarks appear on the WinCE documentation.

<Quote from MSDN>
Because the data captured by CreateToolhelp32Snapshot is static and the system 
is dynamic, use try-excepts around the APIs that access this data.
</Quote from MSDN>

The second thing is that the crashes in our application always occurred when 
the NVIDIA drivers were handling a wglMakeCurrent() from another thread (our 
application opens a OpenGL context while starting an SSL connection).


Following the WinCE MSDN doc, we've protected the inner loop of the heap 
traversing with try/except blocks. Trials shows this fixes the crashes (the 
debugger shows the access violation is caught, and the loop moves on to the 
next heap list).

See the attached source file for more information on the modifications we've 
made (just diff the file against OpenSSL 0.9.8j).


Our current assessment is the following.

- Either the MSDN documentation is incomplete, and CreateToolhelp32Snapshot's 
snapshots are not as safe as advertised (WinCE doc tends to hint that way). In 
which case try/exception blocks are mandatory.

- Or external components beyond MS control (such as NVIDIA drivers in this 
case) break the safety guarantees of CreateToolhelp32Snapshot.


As far as we know, our fix does not reduce the randomness of RAND_poll. Would 
it be premature to think about integrating it into OpenSSL?

Considering WinCE MSDN documentation explicitly states that try/except blocks 
should be used, it might be wise to do it in any case.


Regards,

Tanguy


Attachment: rand_win.c
Description: Binary data

Reply via email to