On 13.06.2014 02:38, Dmitry Olshansky wrote:
12-Jun-2014 10:34, Rainer Schuetze пишет:

I implemented the QueryWorkingSetEx version like this (you need a
converted psapi.lib for Win32):

Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.

Grab coffimplib.exe.

This function
is not supported on XP, though.

I wouldn't worry about it, it's not like XP users are growing in
numbers. Also it looks like only 64bit version is good to go, as on
32bit it would reduce usable memory in half.

There could also be the fallback to VirtualQuery if QueryWorkingSetEx doesn't exist.

A short benchmark shows that VirtualQuery needs 55/42 ms for your test
on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17
ms for both.

Seems in line with my measurements. Strictly speaking 1/2 of pages,
interleaved should give the estimate of the worst case. Together with
remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb
of heap.

If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
copied), I get 80-90 ms more.

Aye... this is a lot. Also for me it turns out that unmapping CoW view
at the last step takes the most of time.

Maybe the memory needs to be actually flushed to the file if no mapping exists. If that is the case, we could avoid that if we create another temporary mapping.

> It might help to split the full heap into multiple views.

The current GC uses pools of max 32 MB, so that already exists.

Also using VirtualProtect during the first step - turning a mapping into
CoW one is faster then unmap/map (by factor of 2).

One thing that may help is saving a pointer to the end of used heap at
the moment of scan, then remaping only this portion as COW.


The pool architecture already does this if scanning just ignores new pools during collection.

Another optimization is to segregate the heap into memory with references and memory with plain data (NO_SCAN) at page/pool granularity. NO_SCAN-pages won't need COW. My hope is that this reduces necessary duplicate memory addresses considerably.

Last issue I see is adjustment of pointers - in a GC, the mapped view is
mapped at new address so it would need a fixup them during scanning.


Agreed, that's a slight additional cost for scanning, but I don't think it will be too difficult to implement.



The numbers are not great, but I guess the usual memory usage and number
of modified pages will be much lower. I'll see if I can integrate this
into the concurrent implementation.

Wish you luck, I'm still not sure if it will help.

Reply via email to