[chromium-dev] Re: [Memory] in TCMalloc, more careful handling of VirtualAlloc commit via SystemAlloc

Anton Muhin Thu, 01 Oct 2009 05:45:04 -0700

Guys, just to summarize the discussion.

There are several ways we can tweak tcmalloc:


1) decommit everything what is free;
2) keep spans with a mixed state (some pages committed, some not,
coalescing nor commit, not decommits)---that should solve main Jim's
argument;
3) commit on coalescing, but aggressively purge (like WebKit do, once
in 5 secs unless something else has been committed, or in idle pauses.

To my knowledge performance-wise 1) is slower (how slower we should
learn), 2) is slightly faster than 3) (but it might be just a
statistical error).  Of course, my benchmark is quite special.

Memory-wise I think 2) and 3) with aggressive scavenging should be
mostly the same---we could keep higher number of committed pages than
in 1), but for short periods of time and I'm not convinced it's a bad
thing.

Overall I'm pro 2) and 3), but I am definitely biased.

What do you think?

And many thanks to Vitaly for discussion.

yours,
anton.
On Thu, Oct 1, 2009 at 3:56 AM, James Robinson <[email protected]> wrote:
> On Wed, Sep 30, 2009 at 2:28 PM, James Robinson <[email protected]> wrote:
>>
>> On Wed, Sep 30, 2009 at 11:29 AM, Anton Muhin <[email protected]> wrote:
>>>
>>> On Wed, Sep 30, 2009 at 10:27 PM, Mike Belshe <[email protected]> wrote:
>>> >
>>> >
>>> > On Wed, Sep 30, 2009 at 11:24 AM, Anton Muhin <[email protected]>
>>> > wrote:
>>> >>
>>> >> On Wed, Sep 30, 2009 at 10:17 PM, Mike Belshe <[email protected]>
>>> >> wrote:
>>> >> > On Wed, Sep 30, 2009 at 11:05 AM, Anton Muhin <[email protected]>
>>> >> > wrote:
>>> >> >>
>>> >> >> On Wed, Sep 30, 2009 at 9:58 PM, Mike Belshe <[email protected]>
>>> >> >> wrote:
>>> >> >> > On Wed, Sep 30, 2009 at 10:48 AM, Anton Muhin <[email protected]>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On Wed, Sep 30, 2009 at 9:39 PM, Jim Roskind <[email protected]>
>>> >> >> >> wrote:
>>> >> >> >> > If you're not interested in TCMalloc customization for
>>> >> >> >> > Chromium,
>>> >> >> >> > you
>>> >> >> >> > should
>>> >> >> >> > stop reading now.
>>> >> >> >> > This post is meant to gather some discussion on a topic before
>>> >> >> >> > I
>>> >> >> >> > code
>>> >> >> >> > and
>>> >> >> >> > land a change.
>>> >> >> >> > MOTIVATION
>>> >> >> >> > We believe poor memory utilization is at the heart of a lot of
>>> >> >> >> > jank
>>> >> >> >> > problems.  Such problems may be difficult to repro in short
>>> >> >> >> > controlled
>>> >> >> >> > benchmarks, but our users are telling us we have problems, so
>>> >> >> >> > we
>>> >> >> >> > know
>>> >> >> >> > we
>>> >> >> >> > have problems.  As a result, we need to be more conservative
>>> >> >> >> > in
>>> >> >> >> > memory
>>> >> >> >> > utilization and handling.
>>> >> >> >> > SUMMARY OF CHANGE
>>> >> >> >> > I'm thinking of changing our TCMalloc so that when a span is
>>> >> >> >> > freed
>>> >> >> >> > into
>>> >> >> >> > TCMalloc's free list, and it gets coalesced with an adjacent
>>> >> >> >> > span
>>> >> >> >> > that
>>> >> >> >> > is
>>> >> >> >> > already decommitted, that the coalesced span should be
>>> >> >> >> > entirely
>>> >> >> >> > decommitted
>>> >> >> >> > (as opposed to our current customized performance of
>>> >> >> >> > committing
>>> >> >> >> > the
>>> >> >> >> > entire
>>> >> >> >> > span).
>>> >> >> >> > This proposed policy was put in place previously by Mike, but
>>> >> >> >> > (reportedly)
>>> >> >> >> > caused a 3-5% perf regression in V8.  I believe AntonM changed
>>> >> >> >> > that
>>> >> >> >> > policy
>>> >> >> >> > to what we have currently, where always ensure full commitment
>>> >> >> >> > of
>>> >> >> >> > a
>>> >> >> >> > coalesced span (regaining V8 performance on a benchmark).
>>> >> >> >>
>>> >> >> >> The immediate question and plea.  Question: how can we estimate
>>> >> >> >> performance implications of the change?  Yes, we have some
>>> >> >> >> internal
>>> >> >> >> benchmarks which could be used for that (they release memory
>>> >> >> >> heavily).
>>> >> >> >>  Anything else?
>>> >> >> >>
>>> >> >> >> Plea: please, do not regress DOM performance unless there are
>>> >> >> >> really
>>> >> >> >> compelling reasons.  And even in this case :)
>>> >> >> >
>>> >> >> > Anton -
>>> >> >> > All evidence from user complaints and bug reports are that Chrome
>>> >> >> > uses
>>> >> >> > too
>>> >> >> > much memory.  If you load Chrome on a 1GB system, you can feel it
>>> >> >> > yourself.
>>> >> >> >  Unfortunately, we have yet to build a reliable swapping
>>> >> >> > benchmark.
>>> >> >> >  By
>>> >> >> > allowing tcmalloc to accumulate large chunks of unused pages, we
>>> >> >> > increase
>>> >> >> > the chance that paging will occur on the system.  But because
>>> >> >> > paging
>>> >> >> > is
>>> >> >> > a
>>> >> >> > system-wide activity, it can hit our various processes in
>>> >> >> > unpredictable
>>> >> >> > ways
>>> >> >> > - and this leads to jank.  I think the jank is worse than the
>>> >> >> > benchmark
>>> >> >> > win.
>>> >> >> > I wish we had a better way to quantify the damage caused by
>>> >> >> > paging.
>>> >> >> >  Jim
>>> >> >> > and
>>> >> >> > others are working on that.
>>> >> >> > But it's clear to me that we're just being a memory pig for what
>>> >> >> > is
>>> >> >> > really a
>>> >> >> > modest gain on a semi-obscure benchmark right now.  Using the
>>> >> >> > current
>>> >> >> > algorithms, we have literally multi-hundred megabyte memory usage
>>> >> >> > swings
>>> >> >> > in
>>> >> >> > exchange for 3% on a benchmark.  Don't you agree this is the
>>> >> >> > wrong
>>> >> >> > tradeoff?
>>> >> >> >  (DOM benchmark grows to 500+MB right now; when you switch tabs
>>> >> >> > it
>>> >> >> > drops
>>> >> >> > to
>>> >> >> > <100MB).  Other pages have been witnessed which have similar
>>> >> >> > behavior
>>> >> >> > (loading the histograms page).
>>> >> >> > We may be able to put in some algorithms which are more aware of
>>> >> >> > the
>>> >> >> > current
>>> >> >> > available memory going forward, but I agree with Jim that there
>>> >> >> > will
>>> >> >> > be
>>> >> >> > a
>>> >> >> > lot of negative effects as long as we continue to have such large
>>> >> >> > memory
>>> >> >> > swings.
>>> >> >>
>>> >> >> Mike, I am completely agree that we should reduce memory usage.  On
>>> >> >> the other hand speed was always one of Chrome trademarks.  My
>>> >> >> feeling
>>> >> >> is more committed pages in free list make us faster (but yes, there
>>> >> >> is
>>> >> >> paging etc.).  That's exactly the reason I asked for some way to
>>> >> >> quantify quality of different approaches, esp. given classic memory
>>> >> >> vs. speed dilemma, ideally (imho) both speed and memory usage
>>> >> >> should
>>> >> >> be considered.
>>> >> >
>>> >> > The team is working on benchmarks.
>>> >> > I think the evidence of paging is pretty overwhelming.
>>> >> > Paging and jank is far worse than the small perf boost on dom node
>>> >> > creation.
>>> >> >  I don't believe the benchmark in question is a significant driver
>>> >> > of
>>> >> > primary performance.  Do you?
>>> >>
>>> >> To some extent.  Just to make it clear: I am not insisting, if
>>> >> consensus is we should trade performance in DOM for reduced memory
>>> >> usage in this case, that's fine.  I only want to have real numbers
>>> >> before we make any decision.
>>> >>
>>> >> @pkasting: it wasn't 3%, it was (closer to 8% if memory serves).
>>> >
>>> > When I checked it in my records show a 217 -> 210 benchmark drop, which
>>> > is
>>> > 3%.
>>>
>>> My numbers were substantially bigger, but anyway we need to remeasure
>>> it---there are too many factors.
>>
>> I did some measurements on my windows machine between the current behavior
>> (always commit spans when merging them together) with a very conservative
>> alternative (always decommit spans on ::Delete, including the just released
>> one).  The interesting bits are the benchmark scores and memory use at the
>> end of the run.
>> For the DOM benchmark, the score regressed from an average over 4 runs of
>> 188.25 to 185 which is <2%.  The peak memory is about the same but the
>> memory committed by the tab at the end of the run decreased from an average
>> of 642MB to 57MB which is a 91% reduction.  4 runs probably isn't enough to
>> make a definitive statement about the perf impact but I think the memory
>> impact is pretty clear.  The memory characteristics of the V8 benchmark was
>> unchanged but the performance dropped from an average of 3009 to 2944, which
>> is about 2%.  Sunspider did not change at all in either memory or
>> performance.
>
> Sorry, disregard those DOM numbers (I wasn't running the right test).
> I re-ran on dromaeo's DOM Core test suite twice with and without the
> aggressive decommitting and the numbers are:
> r23768 unmodified:
> scores: 299.36 run/s  302.47 run/s
> memory footprint of renderer at end of run: 333,648KB 334,156KB
> r23768 with decommitting:
> scores: 296.06 run/s  293.88 run/s
> memory footprint of renderer at end of run: 91,856KB 68,208KB
> I think if the tradeoff is between <2% perf compared to 3-5x memory use it's
> better to get more conservative with our memory use first and then figure
> out how to earn back the perf impact without blowing the memory use sky-high
> again.  I think it's pretty clear we don't need all 200MB of extra committed
> memory in order to do 3 more runs per second.
> - James
>>
>> - James
>>>
>>> yours,
>>> anton.
>>>
>>> >>
>>> >> And forgotten.  Regarding the policy to decommit spans in ::Delete.
>>> >> Please, correct me if I'm wrong, but doesn't that actually would make
>>> >> all the free spans decommitted---the span would be only committed when
>>> >> it gets allocated, no?  Decommitting only if any of adjacent spans is
>>> >> decommitted may keep some spans committed, but it's difficult for me
>>> >> to say how often.
>>> >
>>> > Oh - more work is still needed, yes :-)
>>> >
>>> > Mike
>>> >
>>> >>
>>> >> yours,
>>> >> anton.
>>> >>
>>> >> > Mike
>>> >> >
>>> >> >>
>>> >> >> yours,
>>> >> >> anton.
>>> >> >>
>>> >> >> > Mike
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >>
>>> >> >> >> > WHY CHANGE?
>>> >> >> >> > The problematic scenario I'm anticipating (and may currently
>>> >> >> >> > be
>>> >> >> >> > burning
>>> >> >> >> > us)
>>> >> >> >> > is:
>>> >> >> >> > a) A (renderer) process allocates a lot of memory, and
>>> >> >> >> > achieves a
>>> >> >> >> > significant high water mark of memory used.
>>> >> >> >> > b) The process deallocates a lot of memory, and it flows into
>>> >> >> >> > the
>>> >> >> >> > TCMalloc
>>> >> >> >> > free list. [We still have a lot of memory attributed to that
>>> >> >> >> > process,
>>> >> >> >> > and
>>> >> >> >> > the app as a whole shows as using that memory.]
>>> >> >> >> > c) We eventually decide to decommit a lot of our free memory.
>>> >> >> >> >  Currently
>>> >> >> >> > this happens when we switch away from a tab. [This saves us
>>> >> >> >> > from
>>> >> >> >> > further
>>> >> >> >> > swapping out the unused memory].
>>> >> >> >> > Now comes the evil problem.
>>> >> >> >> > d) We return to the tab which has a giant free list of spans,
>>> >> >> >> > most
>>> >> >> >> > of
>>> >> >> >> > which
>>> >> >> >> > are decommitted.  [The good news is that the memory is still
>>> >> >> >> >  decommitted]
>>> >> >> >> > e) We allocate  a block of memory, such as 32k chunk.  This
>>> >> >> >> > memory
>>> >> >> >> > is
>>> >> >> >> > pulled
>>> >> >> >> > from a decommitted span, and ONLY the allocated chunk is
>>> >> >> >> > committed.
>>> >> >> >> > [That
>>> >> >> >> > sounds good]
>>> >> >> >> > f) We free the block of memory from (e).  What ever span is
>>> >> >> >> > adjacent
>>> >> >> >> > to
>>> >> >> >> > that
>>> >> >> >> > block is committed <potential oops>.  Hence, if we he took (e)
>>> >> >> >> > from a
>>> >> >> >> > 200Meg
>>> >> >> >> > span, the act of freeing (e) will cause a 200Meg commitment!?!
>>> >> >> >> >  This
>>> >> >> >> > in
>>> >> >> >> > turn
>>> >> >> >> > would not only require touching (and having VirtualAlloc clear
>>> >> >> >> > to
>>> >> >> >> > zero)
>>> >> >> >> > all
>>> >> >> >> > allocated memory in the large span, it will also immediately
>>> >> >> >> > put
>>> >> >> >> > memory
>>> >> >> >> > pressure on the OS, and force as much as 200Megs of other apps
>>> >> >> >> > to
>>> >> >> >> > be
>>> >> >> >> > swapped
>>> >> >> >> > out to disk :-(.
>>> >> >> >>
>>> >> >> >> I'm not sure about swapping unless you touch those now committed
>>> >> >> >> pages, but only experiment will tell.
>>> >> >> >>
>>> >> >> >> > I'm wary that our recent fix that allows spans to be
>>> >> >> >> > (correctly)
>>> >> >> >> > coalesced
>>> >> >> >> > independent of their size should cause it to be easier to
>>> >> >> >> > coalesce
>>> >> >> >> > spans.
>>> >> >> >> >  Worse yet, as we proceed to further optimize TCMalloc, one
>>> >> >> >> > measure
>>> >> >> >> > of
>>> >> >> >> > success will be that the list of spans will be fragmented less
>>> >> >> >> > and
>>> >> >> >> > less,
>>> >> >> >> > and
>>> >> >> >> > we'll have larger and larger coalesced singular spans.  Any
>>> >> >> >> > large
>>> >> >> >> > "reserved"
>>> >> >> >> > but not "commited" span will be a jank time-bomb waiting to
>>> >> >> >> > blow
>>> >> >> >> > up
>>> >> >> >> > if
>>> >> >> >> > the
>>> >> >> >> > process every allocates/frees from such a large span :-(.
>>> >> >> >> >
>>> >> >> >> > WHAT IS THE PLAN GOING FORWARD (or how can we do better, and
>>> >> >> >> > regain
>>> >> >> >> > performance, etc.)
>>> >> >> >> > We have at least the following plausible alternative ways to
>>> >> >> >> > move
>>> >> >> >> > forward
>>> >> >> >> > with TCMalloc.  The overall goal is to avoid wasteful
>>> >> >> >> > decommits,
>>> >> >> >> > and
>>> >> >> >> > at
>>> >> >> >> > the
>>> >> >> >> > same time avoid heap-wide flailing between minimal and maximal
>>> >> >> >> > span
>>> >> >> >> > commitment states.
>>> >> >> >> > Each free-span is currently the maximal contiguous region of
>>> >> >> >> > memory
>>> >> >> >> > that
>>> >> >> >> > TCMalloc is controlling, but has been deallocated.  Currently
>>> >> >> >> > spans
>>> >> >> >> > have
>>> >> >> >> > to
>>> >> >> >> > be totally committed, or totally decommitted.  There is no
>>> >> >> >> > mixture
>>> >> >> >> > supported.
>>> >> >> >> > a) We could re-architect the span handling to allow spans to
>>> >> >> >> > be
>>> >> >> >> > combinations
>>> >> >> >> > of committed and decommitted regions.
>>> >> >> >> > b) We could vary out policy on what to do with a coalesced
>>> >> >> >> > span,
>>> >> >> >> > based
>>> >> >> >> > on
>>> >> >> >> > span size and memory pressure.  For example: We can
>>> >> >> >> > consistently
>>> >> >> >> > monitor
>>> >> >> >> > the
>>> >> >> >> > in-use vs free (but committed) ratio.  We can try to stay in
>>> >> >> >> > some
>>> >> >> >> > "acceptable" region by varying our policy.
>>> >> >> >> > c) We could actually return to the OS some portions of spans
>>> >> >> >> > that
>>> >> >> >> > we
>>> >> >> >> > have
>>> >> >> >> > decommitted.  We could then let the OS give us back these
>>> >> >> >> > regions
>>> >> >> >> > if
>>> >> >> >> > we
>>> >> >> >> > need
>>> >> >> >> > memory.  Until we get them back, we would not be at risk of
>>> >> >> >> > doing
>>> >> >> >> > unnecessary commits.  Decisions about when to return to the OS
>>> >> >> >> > can
>>> >> >> >> > be
>>> >> >> >> > made
>>> >> >> >> > based on span size and memory pressure.
>>> >> >> >> > d) We can change the interval and forcing function for
>>> >> >> >> > decommitting
>>> >> >> >> > spans
>>> >> >> >> > that are in our free list.
>>> >> >> >> > In each of the above cases, we need benchmark data on
>>> >> >> >> > user-class
>>> >> >> >> > machines to
>>> >> >> >> > show costs of these changes.  Until we understand the memory
>>> >> >> >> > impact,
>>> >> >> >> > we
>>> >> >> >> > need
>>> >> >> >> > to move forward conservatively in our action, and be vigilant
>>> >> >> >> > for
>>> >> >> >> > thrashing
>>> >> >> >> > scenarios.
>>> >> >> >> >
>>> >> >> >> > Comments??
>>> >> >> >>
>>> >> >> >> As a close attempt you may have a look at
>>> >> >> >> http://codereview.chromium.org/256013/show
>>> >> >> >>
>>> >> >> >> That allows spans with a mix of committed/decommitted pages (but
>>> >> >> >> only
>>> >> >> >> in returned list) as committing seems to live fine if some pages
>>> >> >> >> are
>>> >> >> >> already committed.
>>> >> >> >>
>>> >> >> >> That has some minor performance benefit, but I didn't
>>> >> >> >> investigate it
>>> >> >> >> in details yet.
>>> >> >> >>
>>> >> >> >> just my 2 cents,
>>> >> >> >> anton.
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>
>
>

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: [email protected] 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

[chromium-dev] Re: [Memory] in TCMalloc, more careful handling of VirtualAlloc commit via SystemAlloc

Reply via email to