Yes, probably (I thought we already had). Someone would need to do some
comparison work though first.
On Wed, Dec 3, 2014 at 10:57 AM Tim Holy <[email protected]> wrote:

> That's a pretty serious bummer. I can't believe anybody puts up with this.
>
> Should we change the default initialization
> https://github.com/JuliaLang/julia/blob/b1c99af9bdeef22e0999b283885977
> 57541e2cc7/base/profile.jl#L44
> so that, on Windows, it fires every 100ms or so? And add a note to the
> Profile
> docs?
>
> --Tim
>
> On Wednesday, December 03, 2014 03:37:59 PM Jameson Nash wrote:
> > The suggestion apparently is to use "Event Tracing for Windows" (aka
> > ptrace/dtrace). Yes, that is faster (on linux too...), but misses the
> point
> > entirely of profiling user code.
> >
> > the other offerings typically wrap StalkWalk64 (as we do), and complain
> > about how absurdly slow it is
> >
> > we used to use RtlCaptureStackBackTrace, but it often failed to give
> useful
> > backtraces. I think it depends too heavily of walking the EBP pointer
> chain
> > (which typically doesn't exist on x86_64). As it happens, the remaining
> > suggestions fall into the category of "well, obviously you should just
> > write your own EBP (32-bit base pointer register) pointer chain walk
> > algorithm. here, I'll even write part of it for you" ... which would be
> > very helpful, if RBP (64-bit base pointer register) was used to make
> stack
> > frame chains (hint: it isn't). and these days, the EBP isn't used to make
> > stack pointer chains on x86 either.
> >
> > llvm3.5 contains the ability to interpret COFF files, so you could
> > presumably write your own stack-walk algorithm. i don't recommend it,
> > however. you might have to call out to StalkWalk anyways to access the
> > microsoft symbol server (yes, off their network servers) to complete the
> > stalk walk symbol lookup correctly
> >
> > On Wed Dec 03 2014 at 10:04:19 AM Tim Holy <[email protected]> wrote:
> > > Some potentially-interesting links (of which I understand very little):
> > > http://stackoverflow.com/questions/860602/recommended-open-> >
> source-profilers#comment2363112_1137133
> > > http://stackoverflow.com/questions/8406175/optimizing-stack-> >
> walking-performance
> > >
> > > I can tell from this comment:
> > > https://github.com/JuliaLang/julia/issues/2597#issuecomment-15159868
> > > that you already know about this (and its negatives):
> > > http://www.lenholgate.com/blog/2008/09/alternative-call-stac
> > > k-capturing.html
> > >
> > > --Tim
> > >
> > > On Wednesday, December 03, 2014 02:25:22 PM Jameson Nash wrote:
> > > > this stack overflow question indicates that there are two options (
> > > > http://stackoverflow.com/questions/153559/what-are-some-> >
> > > good-profilers-for-n
> > >
> > > > ative-c-on-windows )
> > > >
> > > > https://software.intel.com/sites/default/files/managed/cd/
> > >
> > > 92/Intel-VTune-Amp
> > >
> > > > lifierXE-2015-Product-Brief-072914.pdf ($900)
> > > > http://www.glowcode.com/summary.htm ($500)
> > > >
> > > >
> > > > On Wed Dec 03 2014 at 9:11:28 AM Stefan Karpinski <
> [email protected]>
> > > >
> > > > wrote:
> > > > > This seems nuts. There have to be good profilers on Windows – how
> do
> > >
> > > those
> > >
> > > > > work?
> > > > >
> > > > > On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <[email protected]>
> > >
> > > wrote:
> > > > >> (I forgot to mention, that, to be fair, the windows machine that
> was
> > >
> > > used
> > >
> > > > >> to run this test was an underpowered dual-core hyperthreaded atom
> > > > >> processor, whereas the linux and mac machines were pretty
> comparable
> > >
> > > Xeon
> > >
> > > > >> and sandybridge machines, respectively. I only gave windows a
> factor
> > >
> > > of 2
> > >
> > > > >> advantage in the above computation in my accounting for this gap)
> > > > >>
> > > > >> On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <[email protected]>
> > >
> > > wrote:
> > > > >>> Wow, those are pathetically-slow backtraces. Since most of us
> don't
> > >
> > > have
> > >
> > > > >>> machines with 500 cores, I don't see anything we can do.
> > > > >>>
> > > > >>> --Tim
> > > > >>>
> > > > >>> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote:
> > > > >>> > you could copy the whole stack (typically only a few 100kb,
> max of
> > > > >>>
> > > > >>> 8MB),
> > > > >>>
> > > > >>> > then do the stack walk offline. if you could change the stack
> > >
> > > pages to
> > >
> > > > >>> > copy-on-write, it may even not be too expensive.
> > > > >>> >
> > > > >>> > but this is the real problem:
> > > > >>> >
> > > > >>> > ```
> > > > >>> >
> > > > >>> > |__/                   |  x86_64-linux-gnu
> > > > >>> >
> > > > >>> > julia> @time for i=1:10^4 backtrace() end
> > > > >>> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated,
> > >
> > > 89.29%
> > >
> > > > >>> gc
> > > > >>>
> > > > >>> > time)
> > > > >>> > ```
> > > > >>> >
> > > > >>> > ```
> > > > >>> >
> > > > >>> > |__/                   |  x86_64-apple-darwin14.0.0
> > > > >>> >
> > > > >>> > julia> @time for i=1:10^4 backtrace() end
> > > > >>> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated,
> > >
> > > 89.96%
> > >
> > > > >>> gc
> > > > >>>
> > > > >>> > time)
> > > > >>> > ```
> > > > >>> >
> > > > >>> > ```
> > > > >>> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time
> for
> > > > >>>
> > > > >>> i=1:10^3
> > > > >>>
> > > > >>> > backtrace() end "
> > > > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol
> > >
> > > <0x00000000>
> > >
> > > > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol
> > >
> > > <0x00000000>
> > >
> > > > >>> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a
> > > > >>> > elapsed time: 22.6314386 seconds (320032016 bytes allocated,
> 1.51%
> > >
> > > gc
> > >
> > > > >>> time)
> > > > >>>
> > > > >>> > ```
> > > > >>> >
> > > > >>> > ```
> > > > >>> >
> > > > >>> > |__/                   |  i686-w64-mingw32
> > > > >>> >
> > > > >>> > julia> @time for i=1:10^4 backtrace() end
> > > > >>> > elapsed time: 69.243275608 seconds (3200320800 bytes allocated,
> > >
> > > 13.16%
> > >
> > > > >>> gc
> > > > >>>
> > > > >>> > time)
> > > > >>> > ```
> > > > >>> >
> > > > >>> > And yes, those gc fractions are verifiably correct. With
> > >
> > > gc_disable(),
> > >
> > > > >>> they
> > > > >>>
> > > > >>> > execute in 1/10 of the time. So, that pretty much means you
> must
> > >
> > > take
> > >
> > > > >>> 1/100
> > > > >>>
> > > > >>> > of the samples if you want to preserve roughly the same slow
> down.
> > >
> > > On
> > >
> > > > >>> > linux, I find the slowdown to be in the range of 2-5x, and
> > > > >>> > consider
> > > > >>>
> > > > >>> that to
> > > > >>>
> > > > >>> > be pretty reasonable, especially for what you're getting. If
> you
> > >
> > > took
> > >
> > > > >>> the
> > > > >>>
> > > > >>> > same number of samples on windows, it would cause a 200-500x
> > >
> > > slowdown
> > >
> > > > >>> (give
> > > > >>>
> > > > >>> > or take a few percent). If you wanted to offload this work to
> > > > >>> > other
> > > > >>>
> > > > >>> cores
> > > > >>>
> > > > >>> > to get the same level of accuracy and no more slowdown than
> linux,
> > >
> > > you
> > >
> > > > >>> > would need a machine with 200-500 processors (give or take
> 2-5)!
> > > > >>> >
> > > > >>> > (I think I did those conversions correctly. However, since I
> just
> > >
> > > did
> > >
> > > > >>> them
> > > > >>>
> > > > >>> > for the purposes of this email, sans calculator, and as I was
> > >
> > > typing,
> > >
> > > > >>> let
> > > > >>>
> > > > >>> > me know if I made more than a factor of 2 error somewhere, or
> just
> > > > >>>
> > > > >>> have fun
> > > > >>>
> > > > >>> > reading https://what-if.xkcd.com/84/ instead)
> > > > >>> >
> > > > >>> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <[email protected]>
> > >
> > > wrote:
> > > > >>> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote:
> > > > >>> > > > You can't profile a moving target. The thread must be
> frozen
> > >
> > > first
> > >
> > > > >>> to
> > > > >>>
> > > > >>> > > > ensure the stack trace doesn't change while attempting to
> > >
> > > record
> > >
> > > > >>> > > > it
> > > > >>> > >
> > > > >>> > > Got it. I assume there's no good way to "make a copy and then
> > > > >>>
> > > > >>> discard if
> > > > >>>
> > > > >>> > > the
> > > > >>> > > copy is bad"?
> > > > >>> > >
> > > > >>> > > --Tim
> > > > >>> > >
> > > > >>> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <
> [email protected]>
> > > > >>>
> > > > >>> wrote:
> > > > >>> > > > > If the work of walking the stack is done in the thread,
> why
> > >
> > > does
> > >
> > > > >>> it
> > > > >>>
> > > > >>> > > cause
> > > > >>> > >
> > > > >>> > > > > any
> > > > >>> > > > > slowdown of the main process?
> > > > >>> > > > >
> > > > >>> > > > > But of course the time it takes to complete the backtrace
> > >
> > > sets
> > >
> > > > >>> > > > > an
> > > > >>> > > > > upper
> > > > >>> > > > > limit
> > > > >>> > > > > on how frequently you can take a snapshot. In that case,
> > >
> > > though,
> > >
> > > > >>> > > couldn't
> > > > >>> > >
> > > > >>> > > > > you
> > > > >>> > > > > just have the thread always collecting backtraces?
> > > > >>> > > > >
> > > > >>> > > > > --Tim
> > > > >>> > > > >
> > > > >>> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash
> wrote:
> > > > >>> > > > > > That's essentially what we do now. (Minus the busy wait
> > >
> > > part).
> > >
> > > > >>> The
> > > > >>>
> > > > >>> > > > > overhead
> > > > >>> > > > >
> > > > >>> > > > > > is too high to run it any more frequently -- it already
> > >
> > > causes
> > >
> > > > >>> a
> > > > >>>
> > > > >>> > > > > > significant performance penalty on the system, even at
> the
> > > > >>> > > > > > much
> > > > >>> > > > > > lower
> > > > >>> > > > > > sample rate than linux. However, I suspect the
> truncated
> > > > >>>
> > > > >>> backtraces
> > > > >>>
> > > > >>> > > on
> > > > >>> > >
> > > > >>> > > > > > win32 were exaggerating the effect somewhat -- that
> should
> > >
> > > not
> > >
> > > > >>> be as
> > > > >>>
> > > > >>> > > > > > much
> > > > >>> > > > > > of an issue now.
> > > > >>> > > > > >
> > > > >>> > > > > > Sure, windows lets you snoop on (and modify) the
> address
> > >
> > > space
> > >
> > > > >>> of
> > > > >>>
> > > > >>> > > > > > any
> > > > >>> > > > > > process, you just need to find the right handle.
> > > > >>> > > > > >
> > > > >>> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <
> > >
> > > [email protected]>
> > >
> > > > >>> wrote:
> > > > >>> > > > > > > On Windows, is there any chance that one could set
> up a
> > > > >>>
> > > > >>> separate
> > > > >>>
> > > > >>> > > > > > > thread
> > > > >>> > > > > > > for
> > > > >>> > > > > > > profiling and use busy-wait to do the timing? (I
> don't
> > >
> > > even
> > >
> > > > >>> know
> > > > >>>
> > > > >>> > > > > whether
> > > > >>> > > > >
> > > > >>> > > > > > > one
> > > > >>> > > > > > > thread can snoop on the execution state of another
> > >
> > > thread.)
> > >
> > > > >>> > > > > > > --Tim
> > > > >>> > > > > > >
> > > > >>> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson
> Nash
> > >
> > > wrote:
> > > > >>> > > > > > > > Although, over thanksgiving, I pushed a number of
> > > > >>> > > > > > > > fixes
> > > > >>>
> > > > >>> which
> > > > >>>
> > > > >>> > > should
> > > > >>> > >
> > > > >>> > > > > > > > improve the quality of backtraces on win32 (and
> make
> > > > >>>
> > > > >>> sys.dll
> > > > >>>
> > > > >>> > > usable
> > > > >>> > >
> > > > >>> > > > > > > there)
> > > > >>> > > > > > >
> > > > >>> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash <
> > > > >>>
> > > > >>> [email protected]>
> > > > >>>
> > > > >>> > > > > wrote:
> > > > >>> > > > > > > > > Correct. Windows imposes a much higher overhead
> on
> > >
> > > just
> > >
> > > > >>> about
> > > > >>>
> > > > >>> > > > > > > > > every
> > > > >>> > > > > > >
> > > > >>> > > > > > > aspect
> > > > >>> > > > > > >
> > > > >>> > > > > > > > > of doing profiling. Unfortunately, there isn't
> much
> > >
> > > we
> > >
> > > > >>> can do
> > > > >>>
> > > > >>> > > > > > > > > about
> > > > >>> > > > > > >
> > > > >>> > > > > > > this,
> > > > >>> > > > > > >
> > > > >>> > > > > > > > > other then to complain to Microsoft. (It doesn't
> > > > >>> > > > > > > > > have
> > > > >>>
> > > > >>> signals,
> > > > >>>
> > > > >>> > > so
> > > > >>> > >
> > > > >>> > > > > we
> > > > >>> > > > >
> > > > >>> > > > > > > must
> > > > >>> > > > > > >
> > > > >>> > > > > > > > > emulate them with a separate thread. The
> accuracy of
> > > > >>>
> > > > >>> windows
> > > > >>>
> > > > >>> > > > > timers is
> > > > >>> > > > >
> > > > >>> > > > > > > > > somewhat questionable. And the stack walk library
> > >
> > > (for
> > >
> > > > >>> > > recording
> > > > >>> > >
> > > > >>> > > > > the
> > > > >>> > > > >
> > > > >>> > > > > > > > > backtrace) is apparently just badly written and
> > > > >>> > > > > > > > > therefore
> > > > >>> > >
> > > > >>> > > insanely
> > > > >>> > >
> > > > >>> > > > > > > > > slow
> > > > >>> > > > > > > > > and
> > > > >>> > > > > > > > > memory hungry.)
> > > > >>> > > > > > > > >
> > > > >>> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy <
> > > > >>>
> > > > >>> [email protected]>
> > > > >>>
> > > > >>> > > > > wrote:
> > > > >>> > > > > > > > >> I think it's just that Windows is bad at
> scheduling
> > > > >>>
> > > > >>> tasks
> > > > >>>
> > > > >>> > > > > > > > >> with
> > > > >>> > > > > > > > >> short-latency,
> > > > >>> > > > > > > > >> high-precision timing, but I am not the right
> > >
> > > person to
> > >
> > > > >>> > > > > > > > >> answer
> > > > >>> > > > >
> > > > >>> > > > > such
> > > > >>> > > > >
> > > > >>> > > > > > > > >> questions.
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> --Tim
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter
> > >
> > > Simon
> > >
> > > > >>> wrote:
> > > > >>> > > > > > > > >> > I have also experienced the inaccurate profile
> > > > >>>
> > > > >>> timings on
> > > > >>>
> > > > >>> > > > > Windows.
> > > > >>> > > > >
> > > > >>> > > > > > > Is
> > > > >>> > > > > > >
> > > > >>> > > > > > > > >> the
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > reason for the bad profiler performance on
> > > > >>> > > > > > > > >> > Windows
> > > > >>> > >
> > > > >>> > > understood?
> > > > >>> > >
> > > > >>> > > > > Are
> > > > >>> > > > >
> > > > >>> > > > > > > > >> there
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > plans for improvement?
> > > > >>> > > > > > > > >> >
> > > > >>> > > > > > > > >> > Thanks,
> > > > >>> > > > > > > > >> > --Peter
> > > > >>> > > > > > > > >> >
> > > > >>> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8,
> > > > >>> > > > > > > > >> > Tim
> > > > >>>
> > > > >>> Holy
> > > > >>>
> > > > >>> > > wrote:
> > > > >>> > > > > > > > >> > > By default, the profiler takes one sample
> per
> > > > >>> > >
> > > > >>> > > millisecond. In
> > > > >>> > >
> > > > >>> > > > > > > > >> practice,
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > the
> > > > >>> > > > > > > > >> > > timing is quite precise on Linux, seemingly
> > >
> > > within
> > >
> > > > >>> > > > > > > > >> > > a
> > > > >>> > >
> > > > >>> > > factor
> > > > >>> > >
> > > > >>> > > > > > > > >> > > of
> > > > >>> > > > > > >
> > > > >>> > > > > > > twoish
> > > > >>> > > > > > >
> > > > >>> > > > > > > > >> on
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > OSX,
> > > > >>> > > > > > > > >> > > and nowhere close on Windows. So at least on
> > >
> > > Linux
> > >
> > > > >>> you
> > > > >>>
> > > > >>> > > > > > > > >> > > can
> > > > >>> > > > >
> > > > >>> > > > > simply
> > > > >>> > > > >
> > > > >>> > > > > > > > >> > > read
> > > > >>> > > > > > > > >> > > samples
> > > > >>> > > > > > > > >> > > as milliseconds.
> > > > >>> > > > > > > > >> > >
> > > > >>> > > > > > > > >> > > If you want to visualize the relative
> > >
> > > contributions
> > >
> > > > >>> of
> > > > >>>
> > > > >>> > > each
> > > > >>> > >
> > > > >>> > > > > > > > >> statement, I
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > highly recommend ProfileView. If you use
> > > > >>>
> > > > >>> LightTable, it's
> > > > >>>
> > > > >>> > > > > already
> > > > >>> > > > >
> > > > >>> > > > > > > > >> built-in
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > via
> > > > >>> > > > > > > > >> > > the profile() command. The combination of
> > > > >>>
> > > > >>> ProfileView and
> > > > >>>
> > > > >>> > > > > > > > >> > > @profile
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> is, in
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > my
> > > > >>> > > > > > > > >> > > (extremely biased) opinion, quite powerful
> > >
> > > compared
> > >
> > > > >>> to
> > > > >>>
> > > > >>> > > tools
> > > > >>> > >
> > > > >>> > > > > > > > >> > > I
> > > > >>> > > > > > >
> > > > >>> > > > > > > used
> > > > >>> > > > > > >
> > > > >>> > > > > > > > >> > > previously
> > > > >>> > > > > > > > >> > > in other programming environments.
> > > > >>> > > > > > > > >> > >
> > > > >>> > > > > > > > >> > > Finally, there's IProfile.jl, which works
> via a
> > > > >>> > > > > > > > >> > > completely
> > > > >>> > > > > > >
> > > > >>> > > > > > > different
> > > > >>> > > > > > >
> > > > >>> > > > > > > > >> > > mechanism
> > > > >>> > > > > > > > >> > > but does report raw timings (with some
> pretty
> > >
> > > big
> > >
> > > > >>> > > caveats).
> > > > >>> > >
> > > > >>> > > > > > > > >> > > Best,
> > > > >>> > > > > > > > >> > > --Tim
> > > > >>> > > > > > > > >> > >
> > > > >>> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM
> > >
> > > Christoph
> > >
> > > > >>> Ortner
> > > > >>>
> > > > >>> > > > > wrote:
> > > > >>> > > > > > > > >> > > > How do you get timings from the Julia
> > >
> > > profiler,
> > >
> > > > >>> or even
> > > > >>>
> > > > >>> > > > > better,
> > > > >>> > > > >
> > > > >>> > > > > > > > >> %-es? I
> > > > >>> > > > > > > > >>
> > > > >>> > > > > > > > >> > > > guess one can convert from the numbers one
> > >
> > > gets,
> > >
> > > > >>> but it
> > > > >>>
> > > > >>> > > is
> > > > >>> > >
> > > > >>> > > > > > > > >> > > > a
> > > > >>> > > > > > > > >> > > > bit
> > > > >>> > > > > > > > >> > >
> > > > >>> > > > > > > > >> > > painful?
> > > > >>> > > > > > > > >> > >
> > > > >>> > > > > > > > >> > > > Christoph
>
>

Reply via email to