This seems nuts. There have to be good profilers on Windows – how do those work?
On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <[email protected]> wrote: > (I forgot to mention, that, to be fair, the windows machine that was used > to run this test was an underpowered dual-core hyperthreaded atom > processor, whereas the linux and mac machines were pretty comparable Xeon > and sandybridge machines, respectively. I only gave windows a factor of 2 > advantage in the above computation in my accounting for this gap) > > On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <[email protected]> wrote: > >> Wow, those are pathetically-slow backtraces. Since most of us don't have >> machines with 500 cores, I don't see anything we can do. >> >> --Tim >> >> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote: >> > you could copy the whole stack (typically only a few 100kb, max of 8MB), >> > then do the stack walk offline. if you could change the stack pages to >> > copy-on-write, it may even not be too expensive. >> > >> > but this is the real problem: >> > >> > ``` >> > >> > |__/ | x86_64-linux-gnu >> > >> > julia> @time for i=1:10^4 backtrace() end >> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29% gc >> > time) >> > ``` >> > >> > ``` >> > >> > |__/ | x86_64-apple-darwin14.0.0 >> > >> > julia> @time for i=1:10^4 backtrace() end >> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96% gc >> > time) >> > ``` >> > >> > ``` >> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for >> i=1:10^3 >> > backtrace() end " >> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> >> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> >> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a >> > elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc >> time) >> > ``` >> > >> > ``` >> > >> > |__/ | i686-w64-mingw32 >> > >> > julia> @time for i=1:10^4 backtrace() end >> > elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16% >> gc >> > time) >> > ``` >> > >> > And yes, those gc fractions are verifiably correct. With gc_disable(), >> they >> > execute in 1/10 of the time. So, that pretty much means you must take >> 1/100 >> > of the samples if you want to preserve roughly the same slow down. On >> > linux, I find the slowdown to be in the range of 2-5x, and consider >> that to >> > be pretty reasonable, especially for what you're getting. If you took >> the >> > same number of samples on windows, it would cause a 200-500x slowdown >> (give >> > or take a few percent). If you wanted to offload this work to other >> cores >> > to get the same level of accuracy and no more slowdown than linux, you >> > would need a machine with 200-500 processors (give or take 2-5)! >> > >> > (I think I did those conversions correctly. However, since I just did >> them >> > for the purposes of this email, sans calculator, and as I was typing, >> let >> > me know if I made more than a factor of 2 error somewhere, or just have >> fun >> > reading https://what-if.xkcd.com/84/ instead) >> > >> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <[email protected]> wrote: >> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote: >> > > > You can't profile a moving target. The thread must be frozen first >> to >> > > > ensure the stack trace doesn't change while attempting to record it >> > > >> > > Got it. I assume there's no good way to "make a copy and then discard >> if >> > > the >> > > copy is bad"? >> > > >> > > --Tim >> > > >> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <[email protected]> wrote: >> > > > > If the work of walking the stack is done in the thread, why does >> it >> > > >> > > cause >> > > >> > > > > any >> > > > > slowdown of the main process? >> > > > > >> > > > > But of course the time it takes to complete the backtrace sets an >> > > > > upper >> > > > > limit >> > > > > on how frequently you can take a snapshot. In that case, though, >> > > >> > > couldn't >> > > >> > > > > you >> > > > > just have the thread always collecting backtraces? >> > > > > >> > > > > --Tim >> > > > > >> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote: >> > > > > > That's essentially what we do now. (Minus the busy wait part). >> The >> > > > > >> > > > > overhead >> > > > > >> > > > > > is too high to run it any more frequently -- it already causes a >> > > > > > significant performance penalty on the system, even at the much >> > > > > > lower >> > > > > > sample rate than linux. However, I suspect the truncated >> backtraces >> > > >> > > on >> > > >> > > > > > win32 were exaggerating the effect somewhat -- that should not >> be as >> > > > > > much >> > > > > > of an issue now. >> > > > > > >> > > > > > Sure, windows lets you snoop on (and modify) the address space >> of >> > > > > > any >> > > > > > process, you just need to find the right handle. >> > > > > > >> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <[email protected]> >> wrote: >> > > > > > > On Windows, is there any chance that one could set up a >> separate >> > > > > > > thread >> > > > > > > for >> > > > > > > profiling and use busy-wait to do the timing? (I don't even >> know >> > > > > >> > > > > whether >> > > > > >> > > > > > > one >> > > > > > > thread can snoop on the execution state of another thread.) >> > > > > > > >> > > > > > > --Tim >> > > > > > > >> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash wrote: >> > > > > > > > Although, over thanksgiving, I pushed a number of fixes >> which >> > > >> > > should >> > > >> > > > > > > > improve the quality of backtraces on win32 (and make sys.dll >> > > >> > > usable >> > > >> > > > > > > there) >> > > > > > > >> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash < >> [email protected]> >> > > > > >> > > > > wrote: >> > > > > > > > > Correct. Windows imposes a much higher overhead on just >> about >> > > > > > > > > every >> > > > > > > >> > > > > > > aspect >> > > > > > > >> > > > > > > > > of doing profiling. Unfortunately, there isn't much we >> can do >> > > > > > > > > about >> > > > > > > >> > > > > > > this, >> > > > > > > >> > > > > > > > > other then to complain to Microsoft. (It doesn't have >> signals, >> > > >> > > so >> > > >> > > > > we >> > > > > >> > > > > > > must >> > > > > > > >> > > > > > > > > emulate them with a separate thread. The accuracy of >> windows >> > > > > >> > > > > timers is >> > > > > >> > > > > > > > > somewhat questionable. And the stack walk library (for >> > > >> > > recording >> > > >> > > > > the >> > > > > >> > > > > > > > > backtrace) is apparently just badly written and therefore >> > > >> > > insanely >> > > >> > > > > > > > > slow >> > > > > > > > > and >> > > > > > > > > memory hungry.) >> > > > > > > > > >> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy < >> [email protected]> >> > > > > >> > > > > wrote: >> > > > > > > > >> I think it's just that Windows is bad at scheduling tasks >> > > > > > > > >> with >> > > > > > > > >> short-latency, >> > > > > > > > >> high-precision timing, but I am not the right person to >> > > > > > > > >> answer >> > > > > >> > > > > such >> > > > > >> > > > > > > > >> questions. >> > > > > > > > >> >> > > > > > > > >> --Tim >> > > > > > > > >> >> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter Simon >> wrote: >> > > > > > > > >> > I have also experienced the inaccurate profile timings >> on >> > > > > >> > > > > Windows. >> > > > > >> > > > > > > Is >> > > > > > > >> > > > > > > > >> the >> > > > > > > > >> >> > > > > > > > >> > reason for the bad profiler performance on Windows >> > > >> > > understood? >> > > >> > > > > Are >> > > > > >> > > > > > > > >> there >> > > > > > > > >> >> > > > > > > > >> > plans for improvement? >> > > > > > > > >> > >> > > > > > > > >> > Thanks, >> > > > > > > > >> > --Peter >> > > > > > > > >> > >> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim Holy >> > > >> > > wrote: >> > > > > > > > >> > > By default, the profiler takes one sample per >> > > >> > > millisecond. In >> > > >> > > > > > > > >> practice, >> > > > > > > > >> >> > > > > > > > >> > > the >> > > > > > > > >> > > timing is quite precise on Linux, seemingly within a >> > > >> > > factor >> > > >> > > > > > > > >> > > of >> > > > > > > >> > > > > > > twoish >> > > > > > > >> > > > > > > > >> on >> > > > > > > > >> >> > > > > > > > >> > > OSX, >> > > > > > > > >> > > and nowhere close on Windows. So at least on Linux >> you >> > > > > > > > >> > > can >> > > > > >> > > > > simply >> > > > > >> > > > > > > > >> > > read >> > > > > > > > >> > > samples >> > > > > > > > >> > > as milliseconds. >> > > > > > > > >> > > >> > > > > > > > >> > > If you want to visualize the relative contributions >> of >> > > >> > > each >> > > >> > > > > > > > >> statement, I >> > > > > > > > >> >> > > > > > > > >> > > highly recommend ProfileView. If you use LightTable, >> it's >> > > > > >> > > > > already >> > > > > >> > > > > > > > >> built-in >> > > > > > > > >> >> > > > > > > > >> > > via >> > > > > > > > >> > > the profile() command. The combination of >> ProfileView and >> > > > > > > > >> > > @profile >> > > > > > > > >> >> > > > > > > > >> is, in >> > > > > > > > >> >> > > > > > > > >> > > my >> > > > > > > > >> > > (extremely biased) opinion, quite powerful compared >> to >> > > >> > > tools >> > > >> > > > > > > > >> > > I >> > > > > > > >> > > > > > > used >> > > > > > > >> > > > > > > > >> > > previously >> > > > > > > > >> > > in other programming environments. >> > > > > > > > >> > > >> > > > > > > > >> > > Finally, there's IProfile.jl, which works via a >> > > > > > > > >> > > completely >> > > > > > > >> > > > > > > different >> > > > > > > >> > > > > > > > >> > > mechanism >> > > > > > > > >> > > but does report raw timings (with some pretty big >> > > >> > > caveats). >> > > >> > > > > > > > >> > > Best, >> > > > > > > > >> > > --Tim >> > > > > > > > >> > > >> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM Christoph >> Ortner >> > > > > >> > > > > wrote: >> > > > > > > > >> > > > How do you get timings from the Julia profiler, or >> even >> > > > > >> > > > > better, >> > > > > >> > > > > > > > >> %-es? I >> > > > > > > > >> >> > > > > > > > >> > > > guess one can convert from the numbers one gets, >> but it >> > > >> > > is >> > > >> > > > > > > > >> > > > a >> > > > > > > > >> > > > bit >> > > > > > > > >> > > >> > > > > > > > >> > > painful? >> > > > > > > > >> > > >> > > > > > > > >> > > > Christoph >> >>
