On Wed, Feb 04, 2015 at 10:15:37AM -0800, Jason Evans wrote: > On Feb 3, 2015, at 4:40 PM, Mike Hommey <m...@glandium.org> wrote: > > On Tue, Feb 03, 2015 at 04:19:00PM -0800, Jason Evans wrote: > >> On Feb 3, 2015, at 2:51 PM, Mike Hommey <m...@glandium.org> wrote: > >>> I've been tracking a startup time regression in Firefox for > >>> Android when we tried to switch from mozjemalloc (memory > >>> refresher: it's derived from jemalloc 0.9) to mostly current > >>> jemalloc dev. > >>> > >>> It turned out to be https://github.com/jemalloc/jemalloc/pull/192 > >> > >> I intentionally removed the functionality #192 adds back (in > >> e3d13060c8a04f08764b16b003169eb205fa09eb), but apparently forgot to > >> update the documentation. Do you have an understanding of why it's > >> hurting performance so much? > > > > My understanding is that the huge increase in page faults is making > > the difference. On Firefox startup we go from 50k page faults to 35k > > with that patch. I can surely double check whether it's really the > > page faults, or if it's actually the madvising itself that causes > > the regression. Or both. > > > >> Is there any chance you can make your test case available so I can > >> dig in further? > > > > https://gist.githubusercontent.com/glandium/a42d0265e324688cafc4/raw/gistfile1.c > > I added some logging and determined that ~90% of the dirty page > purging is happening in the first 2% of the allocation trace. This > appears to be almost entirely due to repeated 32 KiB > allocation/deallocation.
So, interestingly, this appears to be a bug that was intended to have been fixed, but wasn't (the repeated allocation/deallocation of 32kiB buffers). Fixing that, however, still leaves us with a big difference in the number of page faults (but lower than before), but now the dirty page purging threshold patch seems to have less impact than it did... I haven't analyzed these builds further yet, so I can't really tell much more at the moment. > I still have vague plans to add time-based hysteresis mechanisms so > that #192 isn't necessary, but until then, #192 it is. Sadly, #192 also makes the RSS footprint bigger when using more than one arena. With 4 cores, so 16 arenas, and default 4MB chunks, that's 64MB of memory that won't be purged. It's not a problem for us because we use 1MB chunks and 1 arena, but I can see this being a problem with the default settings. FWIW, I also tried to remove all the bin mutexes, and make them all use the arena mutex, and, counter-intuitively, it made things faster. Not by a very significant margin, though, but it's interesting to note that the synchronization overheads of n locks can make things slower than 1 lock with more contention. IOW, I'm still searching for what's wrong :( Mike _______________________________________________ jemalloc-discuss mailing list jemalloc-discuss@canonware.com http://www.canonware.com/mailman/listinfo/jemalloc-discuss