One other thing you might do, and I hesitate to make this recommendation because it's still a decidedly difficult to reproduce scenario, is to take the min of 100 trials of the max of 10000 or something like that.
The trick there is making sure the "time of interest" for that max of 10000 is short enough that the min has a > 1% chance of filtering out all extraneous "unwanted" system activity that might be sampled. There is other pressure wanting to make that time long - such as being able to probe L3/DIMM-resident behavior of things. You might not be able to get even a 1% chance of a quiescent system for the duration of that 10000 trial thing. On unix, you could shutdown the window system/all the services and say, be in single user mode..Maybe even unplug the network cable and definitely don't make any keystrokes after you press ENTER. Presumably what you are after is "Worst case-ness of just GC related things". That might be a start to a measuring mode of assessing it instead of just "pondering". I'm not sure it's much better than pondering, though, and definitely deserves fair warning to would be reproducers.
