Hello,
This week I worked on an heuristic that sounds like this: always schedule a process to the closest cpu relatively to the cpu it had run before.(e.g.: try to run on the cpu that had run before, if it isn't free, try on it's sibling on the current level (let's say thread level), if no sibling was found, go to the next level (which is the core level), and so on). So, in other words, the process would be scheduled to the closest cache it can (it couldn't be scheduled on the same core, will loose the level1 cache hotness, but it can use the level3 cache hotness). Unfortunately, I couldn't find a case to make a difference on my corei3. The reason would be that the time quanta of a process is big enough in dragonfly that a process that is scheduled is able to use a great part of the L1/L2 cache and on a context switch the cache will be invalidated. With the L3 cache shared among all cores, there will be no gain. I am waiting for access on a multi-socket machine, to test it there.