The suggestion apparently is to use "Event Tracing for Windows" (aka
ptrace/dtrace). Yes, that is faster (on linux too...), but misses the point
entirely of profiling user code.

the other offerings typically wrap StalkWalk64 (as we do), and complain
about how absurdly slow it is

we used to use RtlCaptureStackBackTrace, but it often failed to give useful
backtraces. I think it depends too heavily of walking the EBP pointer chain
(which typically doesn't exist on x86_64). As it happens, the remaining
suggestions fall into the category of "well, obviously you should just
write your own EBP (32-bit base pointer register) pointer chain walk
algorithm. here, I'll even write part of it for you" ... which would be
very helpful, if RBP (64-bit base pointer register) was used to make stack
frame chains (hint: it isn't). and these days, the EBP isn't used to make
stack pointer chains on x86 either.

llvm3.5 contains the ability to interpret COFF files, so you could
presumably write your own stack-walk algorithm. i don't recommend it,
however. you might have to call out to StalkWalk anyways to access the
microsoft symbol server (yes, off their network servers) to complete the
stalk walk symbol lookup correctly

On Wed Dec 03 2014 at 10:04:19 AM Tim Holy <[email protected]> wrote:

> Some potentially-interesting links (of which I understand very little):
> http://stackoverflow.com/questions/860602/recommended-open-
> source-profilers#comment2363112_1137133
> http://stackoverflow.com/questions/8406175/optimizing-stack-
> walking-performance
>
> I can tell from this comment:
> https://github.com/JuliaLang/julia/issues/2597#issuecomment-15159868
> that you already know about this (and its negatives):
> http://www.lenholgate.com/blog/2008/09/alternative-call-stac
> k-capturing.html
>
> --Tim
>
>
> On Wednesday, December 03, 2014 02:25:22 PM Jameson Nash wrote:
> > this stack overflow question indicates that there are two options (
> > http://stackoverflow.com/questions/153559/what-are-some-
> good-profilers-for-n
> > ative-c-on-windows )
> >
> > https://software.intel.com/sites/default/files/managed/cd/
> 92/Intel-VTune-Amp
> > lifierXE-2015-Product-Brief-072914.pdf ($900)
> > http://www.glowcode.com/summary.htm ($500)
> >
> >
> > On Wed Dec 03 2014 at 9:11:28 AM Stefan Karpinski <[email protected]>
> >
> > wrote:
> > > This seems nuts. There have to be good profilers on Windows – how do
> those
> > > work?
> > >
> > > On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <[email protected]>
> wrote:
> > >> (I forgot to mention, that, to be fair, the windows machine that was
> used
> > >> to run this test was an underpowered dual-core hyperthreaded atom
> > >> processor, whereas the linux and mac machines were pretty comparable
> Xeon
> > >> and sandybridge machines, respectively. I only gave windows a factor
> of 2
> > >> advantage in the above computation in my accounting for this gap)
> > >>
> > >> On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <[email protected]>
> wrote:
> > >>> Wow, those are pathetically-slow backtraces. Since most of us don't
> have
> > >>> machines with 500 cores, I don't see anything we can do.
> > >>>
> > >>> --Tim
> > >>>
> > >>> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote:
> > >>> > you could copy the whole stack (typically only a few 100kb, max of
> > >>>
> > >>> 8MB),
> > >>>
> > >>> > then do the stack walk offline. if you could change the stack
> pages to
> > >>> > copy-on-write, it may even not be too expensive.
> > >>> >
> > >>> > but this is the real problem:
> > >>> >
> > >>> > ```
> > >>> >
> > >>> > |__/                   |  x86_64-linux-gnu
> > >>> >
> > >>> > julia> @time for i=1:10^4 backtrace() end
> > >>> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated,
> 89.29%
> > >>>
> > >>> gc
> > >>>
> > >>> > time)
> > >>> > ```
> > >>> >
> > >>> > ```
> > >>> >
> > >>> > |__/                   |  x86_64-apple-darwin14.0.0
> > >>> >
> > >>> > julia> @time for i=1:10^4 backtrace() end
> > >>> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated,
> 89.96%
> > >>>
> > >>> gc
> > >>>
> > >>> > time)
> > >>> > ```
> > >>> >
> > >>> > ```
> > >>> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for
> > >>>
> > >>> i=1:10^3
> > >>>
> > >>> > backtrace() end "
> > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol
> <0x00000000>
> > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol
> <0x00000000>
> > >>> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a
> > >>> > elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51%
> gc
> > >>>
> > >>> time)
> > >>>
> > >>> > ```
> > >>> >
> > >>> > ```
> > >>> >
> > >>> > |__/                   |  i686-w64-mingw32
> > >>> >
> > >>> > julia> @time for i=1:10^4 backtrace() end
> > >>> > elapsed time: 69.243275608 seconds (3200320800 bytes allocated,
> 13.16%
> > >>>
> > >>> gc
> > >>>
> > >>> > time)
> > >>> > ```
> > >>> >
> > >>> > And yes, those gc fractions are verifiably correct. With
> gc_disable(),
> > >>>
> > >>> they
> > >>>
> > >>> > execute in 1/10 of the time. So, that pretty much means you must
> take
> > >>>
> > >>> 1/100
> > >>>
> > >>> > of the samples if you want to preserve roughly the same slow down.
> On
> > >>> > linux, I find the slowdown to be in the range of 2-5x, and consider
> > >>>
> > >>> that to
> > >>>
> > >>> > be pretty reasonable, especially for what you're getting. If you
> took
> > >>>
> > >>> the
> > >>>
> > >>> > same number of samples on windows, it would cause a 200-500x
> slowdown
> > >>>
> > >>> (give
> > >>>
> > >>> > or take a few percent). If you wanted to offload this work to other
> > >>>
> > >>> cores
> > >>>
> > >>> > to get the same level of accuracy and no more slowdown than linux,
> you
> > >>> > would need a machine with 200-500 processors (give or take 2-5)!
> > >>> >
> > >>> > (I think I did those conversions correctly. However, since I just
> did
> > >>>
> > >>> them
> > >>>
> > >>> > for the purposes of this email, sans calculator, and as I was
> typing,
> > >>>
> > >>> let
> > >>>
> > >>> > me know if I made more than a factor of 2 error somewhere, or just
> > >>>
> > >>> have fun
> > >>>
> > >>> > reading https://what-if.xkcd.com/84/ instead)
> > >>> >
> > >>> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <[email protected]>
> wrote:
> > >>> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote:
> > >>> > > > You can't profile a moving target. The thread must be frozen
> first
> > >>>
> > >>> to
> > >>>
> > >>> > > > ensure the stack trace doesn't change while attempting to
> record
> > >>> > > > it
> > >>> > >
> > >>> > > Got it. I assume there's no good way to "make a copy and then
> > >>>
> > >>> discard if
> > >>>
> > >>> > > the
> > >>> > > copy is bad"?
> > >>> > >
> > >>> > > --Tim
> > >>> > >
> > >>> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <[email protected]>
> > >>>
> > >>> wrote:
> > >>> > > > > If the work of walking the stack is done in the thread, why
> does
> > >>>
> > >>> it
> > >>>
> > >>> > > cause
> > >>> > >
> > >>> > > > > any
> > >>> > > > > slowdown of the main process?
> > >>> > > > >
> > >>> > > > > But of course the time it takes to complete the backtrace
> sets
> > >>> > > > > an
> > >>> > > > > upper
> > >>> > > > > limit
> > >>> > > > > on how frequently you can take a snapshot. In that case,
> though,
> > >>> > >
> > >>> > > couldn't
> > >>> > >
> > >>> > > > > you
> > >>> > > > > just have the thread always collecting backtraces?
> > >>> > > > >
> > >>> > > > > --Tim
> > >>> > > > >
> > >>> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote:
> > >>> > > > > > That's essentially what we do now. (Minus the busy wait
> part).
> > >>>
> > >>> The
> > >>>
> > >>> > > > > overhead
> > >>> > > > >
> > >>> > > > > > is too high to run it any more frequently -- it already
> causes
> > >>>
> > >>> a
> > >>>
> > >>> > > > > > significant performance penalty on the system, even at the
> > >>> > > > > > much
> > >>> > > > > > lower
> > >>> > > > > > sample rate than linux. However, I suspect the truncated
> > >>>
> > >>> backtraces
> > >>>
> > >>> > > on
> > >>> > >
> > >>> > > > > > win32 were exaggerating the effect somewhat -- that should
> not
> > >>>
> > >>> be as
> > >>>
> > >>> > > > > > much
> > >>> > > > > > of an issue now.
> > >>> > > > > >
> > >>> > > > > > Sure, windows lets you snoop on (and modify) the address
> space
> > >>>
> > >>> of
> > >>>
> > >>> > > > > > any
> > >>> > > > > > process, you just need to find the right handle.
> > >>> > > > > >
> > >>> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <
> [email protected]>
> > >>>
> > >>> wrote:
> > >>> > > > > > > On Windows, is there any chance that one could set up a
> > >>>
> > >>> separate
> > >>>
> > >>> > > > > > > thread
> > >>> > > > > > > for
> > >>> > > > > > > profiling and use busy-wait to do the timing? (I don't
> even
> > >>>
> > >>> know
> > >>>
> > >>> > > > > whether
> > >>> > > > >
> > >>> > > > > > > one
> > >>> > > > > > > thread can snoop on the execution state of another
> thread.)
> > >>> > > > > > >
> > >>> > > > > > > --Tim
> > >>> > > > > > >
> > >>> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash
> wrote:
> > >>> > > > > > > > Although, over thanksgiving, I pushed a number of fixes
> > >>>
> > >>> which
> > >>>
> > >>> > > should
> > >>> > >
> > >>> > > > > > > > improve the quality of backtraces on win32 (and make
> > >>>
> > >>> sys.dll
> > >>>
> > >>> > > usable
> > >>> > >
> > >>> > > > > > > there)
> > >>> > > > > > >
> > >>> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash <
> > >>>
> > >>> [email protected]>
> > >>>
> > >>> > > > > wrote:
> > >>> > > > > > > > > Correct. Windows imposes a much higher overhead on
> just
> > >>>
> > >>> about
> > >>>
> > >>> > > > > > > > > every
> > >>> > > > > > >
> > >>> > > > > > > aspect
> > >>> > > > > > >
> > >>> > > > > > > > > of doing profiling. Unfortunately, there isn't much
> we
> > >>>
> > >>> can do
> > >>>
> > >>> > > > > > > > > about
> > >>> > > > > > >
> > >>> > > > > > > this,
> > >>> > > > > > >
> > >>> > > > > > > > > other then to complain to Microsoft. (It doesn't have
> > >>>
> > >>> signals,
> > >>>
> > >>> > > so
> > >>> > >
> > >>> > > > > we
> > >>> > > > >
> > >>> > > > > > > must
> > >>> > > > > > >
> > >>> > > > > > > > > emulate them with a separate thread. The accuracy of
> > >>>
> > >>> windows
> > >>>
> > >>> > > > > timers is
> > >>> > > > >
> > >>> > > > > > > > > somewhat questionable. And the stack walk library
> (for
> > >>> > >
> > >>> > > recording
> > >>> > >
> > >>> > > > > the
> > >>> > > > >
> > >>> > > > > > > > > backtrace) is apparently just badly written and
> > >>> > > > > > > > > therefore
> > >>> > >
> > >>> > > insanely
> > >>> > >
> > >>> > > > > > > > > slow
> > >>> > > > > > > > > and
> > >>> > > > > > > > > memory hungry.)
> > >>> > > > > > > > >
> > >>> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy <
> > >>>
> > >>> [email protected]>
> > >>>
> > >>> > > > > wrote:
> > >>> > > > > > > > >> I think it's just that Windows is bad at scheduling
> > >>>
> > >>> tasks
> > >>>
> > >>> > > > > > > > >> with
> > >>> > > > > > > > >> short-latency,
> > >>> > > > > > > > >> high-precision timing, but I am not the right
> person to
> > >>> > > > > > > > >> answer
> > >>> > > > >
> > >>> > > > > such
> > >>> > > > >
> > >>> > > > > > > > >> questions.
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> --Tim
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter
> Simon
> > >>>
> > >>> wrote:
> > >>> > > > > > > > >> > I have also experienced the inaccurate profile
> > >>>
> > >>> timings on
> > >>>
> > >>> > > > > Windows.
> > >>> > > > >
> > >>> > > > > > > Is
> > >>> > > > > > >
> > >>> > > > > > > > >> the
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > reason for the bad profiler performance on Windows
> > >>> > >
> > >>> > > understood?
> > >>> > >
> > >>> > > > > Are
> > >>> > > > >
> > >>> > > > > > > > >> there
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > plans for improvement?
> > >>> > > > > > > > >> >
> > >>> > > > > > > > >> > Thanks,
> > >>> > > > > > > > >> > --Peter
> > >>> > > > > > > > >> >
> > >>> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim
> > >>>
> > >>> Holy
> > >>>
> > >>> > > wrote:
> > >>> > > > > > > > >> > > By default, the profiler takes one sample per
> > >>> > >
> > >>> > > millisecond. In
> > >>> > >
> > >>> > > > > > > > >> practice,
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > the
> > >>> > > > > > > > >> > > timing is quite precise on Linux, seemingly
> within
> > >>> > > > > > > > >> > > a
> > >>> > >
> > >>> > > factor
> > >>> > >
> > >>> > > > > > > > >> > > of
> > >>> > > > > > >
> > >>> > > > > > > twoish
> > >>> > > > > > >
> > >>> > > > > > > > >> on
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > OSX,
> > >>> > > > > > > > >> > > and nowhere close on Windows. So at least on
> Linux
> > >>>
> > >>> you
> > >>>
> > >>> > > > > > > > >> > > can
> > >>> > > > >
> > >>> > > > > simply
> > >>> > > > >
> > >>> > > > > > > > >> > > read
> > >>> > > > > > > > >> > > samples
> > >>> > > > > > > > >> > > as milliseconds.
> > >>> > > > > > > > >> > >
> > >>> > > > > > > > >> > > If you want to visualize the relative
> contributions
> > >>>
> > >>> of
> > >>>
> > >>> > > each
> > >>> > >
> > >>> > > > > > > > >> statement, I
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > highly recommend ProfileView. If you use
> > >>>
> > >>> LightTable, it's
> > >>>
> > >>> > > > > already
> > >>> > > > >
> > >>> > > > > > > > >> built-in
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > via
> > >>> > > > > > > > >> > > the profile() command. The combination of
> > >>>
> > >>> ProfileView and
> > >>>
> > >>> > > > > > > > >> > > @profile
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> is, in
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > my
> > >>> > > > > > > > >> > > (extremely biased) opinion, quite powerful
> compared
> > >>>
> > >>> to
> > >>>
> > >>> > > tools
> > >>> > >
> > >>> > > > > > > > >> > > I
> > >>> > > > > > >
> > >>> > > > > > > used
> > >>> > > > > > >
> > >>> > > > > > > > >> > > previously
> > >>> > > > > > > > >> > > in other programming environments.
> > >>> > > > > > > > >> > >
> > >>> > > > > > > > >> > > Finally, there's IProfile.jl, which works via a
> > >>> > > > > > > > >> > > completely
> > >>> > > > > > >
> > >>> > > > > > > different
> > >>> > > > > > >
> > >>> > > > > > > > >> > > mechanism
> > >>> > > > > > > > >> > > but does report raw timings (with some pretty
> big
> > >>> > >
> > >>> > > caveats).
> > >>> > >
> > >>> > > > > > > > >> > > Best,
> > >>> > > > > > > > >> > > --Tim
> > >>> > > > > > > > >> > >
> > >>> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM
> Christoph
> > >>>
> > >>> Ortner
> > >>>
> > >>> > > > > wrote:
> > >>> > > > > > > > >> > > > How do you get timings from the Julia
> profiler,
> > >>>
> > >>> or even
> > >>>
> > >>> > > > > better,
> > >>> > > > >
> > >>> > > > > > > > >> %-es? I
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> > > > guess one can convert from the numbers one
> gets,
> > >>>
> > >>> but it
> > >>>
> > >>> > > is
> > >>> > >
> > >>> > > > > > > > >> > > > a
> > >>> > > > > > > > >> > > > bit
> > >>> > > > > > > > >> > >
> > >>> > > > > > > > >> > > painful?
> > >>> > > > > > > > >> > >
> > >>> > > > > > > > >> > > > Christoph
>
>

Reply via email to