Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)

Adam Kocoloski Thu, 14 May 2009 08:04:06 -0700

Hi Paul,

On May 13, 2009, at 4:01 PM, Paul Davis wrote:

Adam,

No worries about the delay. I'd agree that the first graph doesn't
really show much than *maybe* we can say the patch reduces the
variability a bit.

Agreed. The variance on the wallclock measurements is frustrating. Istarted a thread about CPU time profiling on erlang-questions, as itseems to be automatically disabled on Linux and not even implementedon OS X. I think wallclock time is generally the right way to go, butif we're testing two implementations of an algorithm that doesn'ttouch the disk then CPU time seems like the better metric to me.

On the second graph, I haven't the faintest why that'd be as such.
I'll have to try and setup fprof and see if I can figure out what
exactly is taking most of the time.

I should clean up and post the code I used to make thesemeasurements. I wanted to just profile the couch_db_updater process,so I disabled hovercraft's delete/create DB calls at the start oflightning() and did them in my test code (so the updater PID didn'tchange on me).

Here's the fprof analysis from a trunk run -- in this case I believeit was 10k docs @ 1k bulk:


http://friendpaste.com/16xwiZuqwWrqYXQeaBS0fx

There's a lot of detail there, but if you stare at it awhile a fewdetails start to emerge:


- 11218 ms total time spent in the emulator (line 13)

- 10028 ms in couch_db_updater:update_docs_int/3 and funs called byit (line 47)

    - 3682 ms in couch_btree:add_remove/3 (line 70)
    - 1941 ms in couch_btree:lookup/2 (line 104)
    - 1262 ms in couch_db_updater:flush_trees/3 (line 265)
    -  951 ms in couch_db_updater:merge_rev_trees/7 (line 322)
    -  910 ms in couch_db_updater:commit_data/2 (line 330)

and so on. Each of those numbers is an ACC that includes all the timespent in functions called by that function. The five functions Ilisted have basically zero overlap, so I'm not double-counting. Youcan certainly drill deeper and see which functions take the most timeinside add_remove/3, etc.

Perhaps we're looking at wrong
thing by reducing term_to_binary. You did say that most of the time
was spent in size/1 as opposed to term_to_binary the other day which
is hard to believe at best.

Agreed, that's pretty difficult to believe. Here's what I saw -- Idefined a function


chunkify_sizer(Elem, AccIn) ->
    Size = size(term_to_binary(Elem)),
    {{Elem, Size}, AccIn+Size}.

Here's the profile for that function

%                                               CNT       ACC       OWN

{[{{couch_btree,'-chunkify/1-fun-0-',2}, 22163, 268.312,154.927}],{ {couch_btree,chunkify_sizer,2}, 22163, 268.312,154.927}, %[{{erlang,term_to_binary,1}, 22163, 108.671,108.671},{garbage_collect, 9, 4.714,4.714}]}.

60% of the time is spent in the body of chunkify_sizer, and only 40%in term_to_binary. I've never seen size/1 show up explicitly in theprofile, but if I define a dummy wrapper


get_size(Bin) ->
    erlang:size(Bin).

then get_size/1 will show up with a large OWN time, so I conclude thatany time spent sizing a binary gets charged to OWN.

You know BERT better than I do -- you said the size of a binary isstored in its header, correct?

I'll put this on the weekend agenda. Until I can show that its
consistently faster I'll hold off.
For reference, when you say 2K docs in batches of 1K, did you mean200K?

No, I meant 2k (2 calls to _bulk_docs). 200k would have generated amulti-GB trace and I think fprof:profile() would have melted myMacBook processing it. YMMV ;-)

Also, note to self, we should check speeds for dealing with uuids too
to see if the non-ordered mode makes a difference.

Agreed. At the moment fprof seems much better suited to identifyinghot spots in code than comparing alternative implementations of afunction. Best thing I've come up with so far is comparing ratios oftime spent in the function (as in Figure 2 above).

Paul
On Wed, May 13, 2009 at 3:33 PM, Adam Kocoloski<[email protected]> wrote:
Sorry for the delay on this front. I ran hovercraft:lightning 20times eachwith and without Paul's patch. Each run inserted 2k docs inbatches of
1000.  Here are two plots showing the effect of the patch:

http://dl.getdropbox.com/u/237885/insert_rate.png
http://dl.getdropbox.com/u/237885/chunkify_fraction.png
The first plot histograms the insert rate for the two scenarios*.I don'treally see much of a difference. The second plot uses fprof toplot thefraction of time the couch_db_updater process spent in chunkify andanyfunctions called by chunkify. For those familiar with fprof, it'sthe ratioof ACC for couch_btree:chunkify/2 divided by OWN for the updaterpid. If
fprof is to be believed, the trunk code is almost 2x faster.

Adam
* the reason the insert rate is so low is because fprof isrunning. Turningoff the profiler speeds things up by an order of magnitude, inaccord with
the numbers Chris has posted.

Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)

Reply via email to