(I forgot to mention, that, to be fair, the windows machine that was used to run this test was an underpowered dual-core hyperthreaded atom processor, whereas the linux and mac machines were pretty comparable Xeon and sandybridge machines, respectively. I only gave windows a factor of 2 advantage in the above computation in my accounting for this gap) On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <[email protected]> wrote:
> Wow, those are pathetically-slow backtraces. Since most of us don't have > machines with 500 cores, I don't see anything we can do. > > --Tim > > On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote: > > you could copy the whole stack (typically only a few 100kb, max of 8MB), > > then do the stack walk offline. if you could change the stack pages to > > copy-on-write, it may even not be too expensive. > > > > but this is the real problem: > > > > ``` > > > > |__/ | x86_64-linux-gnu > > > > julia> @time for i=1:10^4 backtrace() end > > elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29% gc > > time) > > ``` > > > > ``` > > > > |__/ | x86_64-apple-darwin14.0.0 > > > > julia> @time for i=1:10^4 backtrace() end > > elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96% gc > > time) > > ``` > > > > ``` > > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for i=1:10^3 > > backtrace() end " > > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> > > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> > > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a > > elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc > time) > > ``` > > > > ``` > > > > |__/ | i686-w64-mingw32 > > > > julia> @time for i=1:10^4 backtrace() end > > elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16% gc > > time) > > ``` > > > > And yes, those gc fractions are verifiably correct. With gc_disable(), > they > > execute in 1/10 of the time. So, that pretty much means you must take > 1/100 > > of the samples if you want to preserve roughly the same slow down. On > > linux, I find the slowdown to be in the range of 2-5x, and consider that > to > > be pretty reasonable, especially for what you're getting. If you took the > > same number of samples on windows, it would cause a 200-500x slowdown > (give > > or take a few percent). If you wanted to offload this work to other cores > > to get the same level of accuracy and no more slowdown than linux, you > > would need a machine with 200-500 processors (give or take 2-5)! > > > > (I think I did those conversions correctly. However, since I just did > them > > for the purposes of this email, sans calculator, and as I was typing, let > > me know if I made more than a factor of 2 error somewhere, or just have > fun > > reading https://what-if.xkcd.com/84/ instead) > > > > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <[email protected]> wrote: > > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote: > > > > You can't profile a moving target. The thread must be frozen first to > > > > ensure the stack trace doesn't change while attempting to record it > > > > > > Got it. I assume there's no good way to "make a copy and then discard > if > > > the > > > copy is bad"? > > > > > > --Tim > > > > > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <[email protected]> wrote: > > > > > If the work of walking the stack is done in the thread, why does it > > > > > > cause > > > > > > > > any > > > > > slowdown of the main process? > > > > > > > > > > But of course the time it takes to complete the backtrace sets an > > > > > upper > > > > > limit > > > > > on how frequently you can take a snapshot. In that case, though, > > > > > > couldn't > > > > > > > > you > > > > > just have the thread always collecting backtraces? > > > > > > > > > > --Tim > > > > > > > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote: > > > > > > That's essentially what we do now. (Minus the busy wait part). > The > > > > > > > > > > overhead > > > > > > > > > > > is too high to run it any more frequently -- it already causes a > > > > > > significant performance penalty on the system, even at the much > > > > > > lower > > > > > > sample rate than linux. However, I suspect the truncated > backtraces > > > > > > on > > > > > > > > > win32 were exaggerating the effect somewhat -- that should not > be as > > > > > > much > > > > > > of an issue now. > > > > > > > > > > > > Sure, windows lets you snoop on (and modify) the address space of > > > > > > any > > > > > > process, you just need to find the right handle. > > > > > > > > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <[email protected]> > wrote: > > > > > > > On Windows, is there any chance that one could set up a > separate > > > > > > > thread > > > > > > > for > > > > > > > profiling and use busy-wait to do the timing? (I don't even > know > > > > > > > > > > whether > > > > > > > > > > > > one > > > > > > > thread can snoop on the execution state of another thread.) > > > > > > > > > > > > > > --Tim > > > > > > > > > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash wrote: > > > > > > > > Although, over thanksgiving, I pushed a number of fixes which > > > > > > should > > > > > > > > > > > improve the quality of backtraces on win32 (and make sys.dll > > > > > > usable > > > > > > > > > > there) > > > > > > > > > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash < > [email protected]> > > > > > > > > > > wrote: > > > > > > > > > Correct. Windows imposes a much higher overhead on just > about > > > > > > > > > every > > > > > > > > > > > > > > aspect > > > > > > > > > > > > > > > > of doing profiling. Unfortunately, there isn't much we can > do > > > > > > > > > about > > > > > > > > > > > > > > this, > > > > > > > > > > > > > > > > other then to complain to Microsoft. (It doesn't have > signals, > > > > > > so > > > > > > > > we > > > > > > > > > > > > must > > > > > > > > > > > > > > > > emulate them with a separate thread. The accuracy of > windows > > > > > > > > > > timers is > > > > > > > > > > > > > > somewhat questionable. And the stack walk library (for > > > > > > recording > > > > > > > > the > > > > > > > > > > > > > > backtrace) is apparently just badly written and therefore > > > > > > insanely > > > > > > > > > > > > slow > > > > > > > > > and > > > > > > > > > memory hungry.) > > > > > > > > > > > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy < > [email protected]> > > > > > > > > > > wrote: > > > > > > > > >> I think it's just that Windows is bad at scheduling tasks > > > > > > > > >> with > > > > > > > > >> short-latency, > > > > > > > > >> high-precision timing, but I am not the right person to > > > > > > > > >> answer > > > > > > > > > > such > > > > > > > > > > > > > >> questions. > > > > > > > > >> > > > > > > > > >> --Tim > > > > > > > > >> > > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter Simon > wrote: > > > > > > > > >> > I have also experienced the inaccurate profile timings > on > > > > > > > > > > Windows. > > > > > > > > > > > > Is > > > > > > > > > > > > > > > >> the > > > > > > > > >> > > > > > > > > >> > reason for the bad profiler performance on Windows > > > > > > understood? > > > > > > > > Are > > > > > > > > > > > > > >> there > > > > > > > > >> > > > > > > > > >> > plans for improvement? > > > > > > > > >> > > > > > > > > > >> > Thanks, > > > > > > > > >> > --Peter > > > > > > > > >> > > > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim Holy > > > > > > wrote: > > > > > > > > >> > > By default, the profiler takes one sample per > > > > > > millisecond. In > > > > > > > > > > > >> practice, > > > > > > > > >> > > > > > > > > >> > > the > > > > > > > > >> > > timing is quite precise on Linux, seemingly within a > > > > > > factor > > > > > > > > > > > >> > > of > > > > > > > > > > > > > > twoish > > > > > > > > > > > > > > > >> on > > > > > > > > >> > > > > > > > > >> > > OSX, > > > > > > > > >> > > and nowhere close on Windows. So at least on Linux you > > > > > > > > >> > > can > > > > > > > > > > simply > > > > > > > > > > > > > >> > > read > > > > > > > > >> > > samples > > > > > > > > >> > > as milliseconds. > > > > > > > > >> > > > > > > > > > > >> > > If you want to visualize the relative contributions of > > > > > > each > > > > > > > > > > > >> statement, I > > > > > > > > >> > > > > > > > > >> > > highly recommend ProfileView. If you use LightTable, > it's > > > > > > > > > > already > > > > > > > > > > > > > >> built-in > > > > > > > > >> > > > > > > > > >> > > via > > > > > > > > >> > > the profile() command. The combination of ProfileView > and > > > > > > > > >> > > @profile > > > > > > > > >> > > > > > > > > >> is, in > > > > > > > > >> > > > > > > > > >> > > my > > > > > > > > >> > > (extremely biased) opinion, quite powerful compared to > > > > > > tools > > > > > > > > > > > >> > > I > > > > > > > > > > > > > > used > > > > > > > > > > > > > > > >> > > previously > > > > > > > > >> > > in other programming environments. > > > > > > > > >> > > > > > > > > > > >> > > Finally, there's IProfile.jl, which works via a > > > > > > > > >> > > completely > > > > > > > > > > > > > > different > > > > > > > > > > > > > > > >> > > mechanism > > > > > > > > >> > > but does report raw timings (with some pretty big > > > > > > caveats). > > > > > > > > > > > >> > > Best, > > > > > > > > >> > > --Tim > > > > > > > > >> > > > > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM Christoph > Ortner > > > > > > > > > > wrote: > > > > > > > > >> > > > How do you get timings from the Julia profiler, or > even > > > > > > > > > > better, > > > > > > > > > > > > > >> %-es? I > > > > > > > > >> > > > > > > > > >> > > > guess one can convert from the numbers one gets, > but it > > > > > > is > > > > > > > > > > > >> > > > a > > > > > > > > >> > > > bit > > > > > > > > >> > > > > > > > > > > >> > > painful? > > > > > > > > >> > > > > > > > > > > >> > > > Christoph > >
