you could copy the whole stack (typically only a few 100kb, max of 8MB),
then do the stack walk offline. if you could change the stack pages to
copy-on-write, it may even not be too expensive.

but this is the real problem:

```
|__/                   |  x86_64-linux-gnu
julia> @time for i=1:10^4 backtrace() end
elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29% gc
time)
```

```
|__/                   |  x86_64-apple-darwin14.0.0
julia> @time for i=1:10^4 backtrace() end
elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96% gc
time)
```

```
jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for i=1:10^3
backtrace() end "
fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000>
fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000>
err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a
elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc time)
```

```
|__/                   |  i686-w64-mingw32
julia> @time for i=1:10^4 backtrace() end
elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16% gc
time)
```

And yes, those gc fractions are verifiably correct. With gc_disable(), they
execute in 1/10 of the time. So, that pretty much means you must take 1/100
of the samples if you want to preserve roughly the same slow down. On
linux, I find the slowdown to be in the range of 2-5x, and consider that to
be pretty reasonable, especially for what you're getting. If you took the
same number of samples on windows, it would cause a 200-500x slowdown (give
or take a few percent). If you wanted to offload this work to other cores
to get the same level of accuracy and no more slowdown than linux, you
would need a machine with 200-500 processors (give or take 2-5)!

(I think I did those conversions correctly. However, since I just did them
for the purposes of this email, sans calculator, and as I was typing, let
me know if I made more than a factor of 2 error somewhere, or just have fun
reading https://what-if.xkcd.com/84/ instead)

On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <[email protected]> wrote:

> On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote:
> > You can't profile a moving target. The thread must be frozen first to
> > ensure the stack trace doesn't change while attempting to record it
>
> Got it. I assume there's no good way to "make a copy and then discard if
> the
> copy is bad"?
>
> --Tim
>
> >
> > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <[email protected]> wrote:
> > > If the work of walking the stack is done in the thread, why does it
> cause
> > > any
> > > slowdown of the main process?
> > >
> > > But of course the time it takes to complete the backtrace sets an upper
> > > limit
> > > on how frequently you can take a snapshot. In that case, though,
> couldn't
> > > you
> > > just have the thread always collecting backtraces?
> > >
> > > --Tim
> > >
> > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote:
> > > > That's essentially what we do now. (Minus the busy wait part). The
> > >
> > > overhead
> > >
> > > > is too high to run it any more frequently -- it already causes a
> > > > significant performance penalty on the system, even at the much lower
> > > > sample rate than linux. However, I suspect the truncated backtraces
> on
> > > > win32 were exaggerating the effect somewhat -- that should not be as
> > > > much
> > > > of an issue now.
> > > >
> > > > Sure, windows lets you snoop on (and modify) the address space of any
> > > > process, you just need to find the right handle.
> > > >
> > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <[email protected]> wrote:
> > > > > On Windows, is there any chance that one could set up a separate
> > > > > thread
> > > > > for
> > > > > profiling and use busy-wait to do the timing? (I don't even know
> > >
> > > whether
> > >
> > > > > one
> > > > > thread can snoop on the execution state of another thread.)
> > > > >
> > > > > --Tim
> > > > >
> > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash wrote:
> > > > > > Although, over thanksgiving, I pushed a number of fixes which
> should
> > > > > > improve the quality of backtraces on win32 (and make sys.dll
> usable
> > > > >
> > > > > there)
> > > > >
> > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash <[email protected]>
> > >
> > > wrote:
> > > > > > > Correct. Windows imposes a much higher overhead on just about
> > > > > > > every
> > > > >
> > > > > aspect
> > > > >
> > > > > > > of doing profiling. Unfortunately, there isn't much we can do
> > > > > > > about
> > > > >
> > > > > this,
> > > > >
> > > > > > > other then to complain to Microsoft. (It doesn't have signals,
> so
> > >
> > > we
> > >
> > > > > must
> > > > >
> > > > > > > emulate them with a separate thread. The accuracy of windows
> > >
> > > timers is
> > >
> > > > > > > somewhat questionable. And the stack walk library (for
> recording
> > >
> > > the
> > >
> > > > > > > backtrace) is apparently just badly written and therefore
> insanely
> > > > > > > slow
> > > > > > > and
> > > > > > > memory hungry.)
> > > > > > >
> > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy <[email protected]>
> > >
> > > wrote:
> > > > > > >> I think it's just that Windows is bad at scheduling tasks with
> > > > > > >> short-latency,
> > > > > > >> high-precision timing, but I am not the right person to answer
> > >
> > > such
> > >
> > > > > > >> questions.
> > > > > > >>
> > > > > > >> --Tim
> > > > > > >>
> > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter Simon wrote:
> > > > > > >> > I have also experienced the inaccurate profile timings on
> > >
> > > Windows.
> > >
> > > > > Is
> > > > >
> > > > > > >> the
> > > > > > >>
> > > > > > >> > reason for the bad profiler performance on Windows
> understood?
> > >
> > > Are
> > >
> > > > > > >> there
> > > > > > >>
> > > > > > >> > plans for improvement?
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> > --Peter
> > > > > > >> >
> > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim Holy
> wrote:
> > > > > > >> > > By default, the profiler takes one sample per
> millisecond. In
> > > > > > >>
> > > > > > >> practice,
> > > > > > >>
> > > > > > >> > > the
> > > > > > >> > > timing is quite precise on Linux, seemingly within a
> factor
> > > > > > >> > > of
> > > > >
> > > > > twoish
> > > > >
> > > > > > >> on
> > > > > > >>
> > > > > > >> > > OSX,
> > > > > > >> > > and nowhere close on Windows. So at least on Linux you can
> > >
> > > simply
> > >
> > > > > > >> > > read
> > > > > > >> > > samples
> > > > > > >> > > as milliseconds.
> > > > > > >> > >
> > > > > > >> > > If you want to visualize the relative contributions of
> each
> > > > > > >>
> > > > > > >> statement, I
> > > > > > >>
> > > > > > >> > > highly recommend ProfileView. If you use LightTable, it's
> > >
> > > already
> > >
> > > > > > >> built-in
> > > > > > >>
> > > > > > >> > > via
> > > > > > >> > > the profile() command. The combination of ProfileView and
> > > > > > >> > > @profile
> > > > > > >>
> > > > > > >> is, in
> > > > > > >>
> > > > > > >> > > my
> > > > > > >> > > (extremely biased) opinion, quite powerful compared to
> tools
> > > > > > >> > > I
> > > > >
> > > > > used
> > > > >
> > > > > > >> > > previously
> > > > > > >> > > in other programming environments.
> > > > > > >> > >
> > > > > > >> > > Finally, there's IProfile.jl, which works via a completely
> > > > >
> > > > > different
> > > > >
> > > > > > >> > > mechanism
> > > > > > >> > > but does report raw timings (with some pretty big
> caveats).
> > > > > > >> > >
> > > > > > >> > > Best,
> > > > > > >> > > --Tim
> > > > > > >> > >
> > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM Christoph Ortner
> > >
> > > wrote:
> > > > > > >> > > > How do you get timings from the Julia profiler, or even
> > >
> > > better,
> > >
> > > > > > >> %-es? I
> > > > > > >>
> > > > > > >> > > > guess one can convert from the numbers one gets, but it
> is
> > > > > > >> > > > a
> > > > > > >> > > > bit
> > > > > > >> > >
> > > > > > >> > > painful?
> > > > > > >> > >
> > > > > > >> > > > Christoph
>
>

Reply via email to