I agree; an np.setnumthreads to manage a numpy-global threadpool makes sense to me.
Of course there are a great many cases where just spawning as many threads as cores is a sensible default, but if this kind of behavior could not be overridden I could see that greatly reduce performance for some of my more complex projects On Fri, Jan 16, 2015 at 4:11 PM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > On 01/16/2015 03:14 PM, Lars Buitinck wrote: > > 2015-01-16 13:29 GMT+01:00 <numpy-discussion-requ...@scipy.org>: > >> Date: Fri, 16 Jan 2015 12:43:43 +0100 > >> From: Julian Taylor <jtaylor.deb...@googlemail.com> > >> Subject: Re: [Numpy-discussion] Sorting refactor > >> To: Discussion of Numerical Python <numpy-discussion@scipy.org> > >> Message-ID: <54b8f96f.7090...@googlemail.com> > >> Content-Type: text/plain; charset=windows-1252 > >> > >> On 16.01.2015 12:33, Lars Buitinck wrote: > >>> * Quicksort has a quadratic time worst case. I think it should be > >>> turned into an introsort [1] for O(n log n) worst case; we have the > >>> heapsort needed to do that. > >> > >> This probably rarely happens in numeric data, and we do have guaranteed > >> nlog runtime algorithms available. > > > > It's no more likely or unlikely than in any other type of data > > (AFAIK), but it's possible for an adversary to DOS any (web server) > > code that uses np.sort. > > if you are using numpy where an arbitrary user is allowed to control the > data passed to a non isolated environment you have a problem anyway. > numpy is far from secure software and there are likely hundreds of ways > to produce DOS and dozens of ways to do code execution in any nontrivial > numpy using application. > > > > >> I was also thinking about this, an advantage of a sorting network is > >> that it can be vectorized to be significantly faster than an insertion > >> sort. Handling NaN's should also be directly possible. > > > > Tried that, and it didn't give any speedup, at least not without > > explicit vector instructions. > > > > Just moving the NaNs aside didn't cost anything in my preliminary > > benchmarks (without sorting nets), the cost of the operation was > > almost exactly compensated by simpler comparisons. > > an SSE2 implementation a 16 entry bitonic sort is available here: > https://github.com/mischasan/sse2/blob/master/ssesort.c > there is also a benchmark, on my machine its 6 times faster than > insertion sort. > But again this would only gain us 5-10% improvement at best as the > partition part of quicksort is still the major time consuming part. > > > > >> The issue is that its probably too much complicated code for only a very > >> small gain. > > > > Maybe. The thing is that the selection algorithms are optimized for > > NaNs and seem to depend on the current comparison code. We'd need > > distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>. > > > > The sorting nets themselves aren't complicated, just lengthy. My > > branch has the length-optimal (not optimally parallel) ones for n <= > > 16. > > > >> But I assume you will get the same trouble as openmp but that needs > >> testing, also cilk+ in gcc is not really production ready yet, I got > >> lots of crashes when I last tried it (it will be in 5.0 though). > > > > The data parallel constructs tend to crash the compiler, but task > > spawning seems to be stable in 4.9.2. I've still to see how it handles > > multiprocessing/fork. > > > > What do you mean by will be in 5.0, did they do a big push? > > gcc 5.0 changelog reports "full support for cilk plus". > Also all bugs I have filed have been fixed in 5.0. > > > > > > >> Date: Fri, 16 Jan 2015 13:28:56 +0100 > >> From: Da?id <davidmen...@gmail.com> > >> Subject: Re: [Numpy-discussion] Sorting refactor > >> To: Discussion of Numerical Python <numpy-discussion@scipy.org> > >> Message-ID: > >> <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k= > iunqreiasdy23dn...@mail.gmail.com> > >> Content-Type: text/plain; charset=UTF-8 > >> > >> On 16 January 2015 at 13:15, Eelco Hoogendoorn > >> <hoogendoorn.ee...@gmail.com> wrote: > >>> Perhaps an algorithm can be made faster, but often these multicore > >>> algorithms are also less efficient, and a less data-dependent way of > putting > >>> my cores to good use would have been preferable. Potentially, other > code > >>> could slow down due to cache trashing if too many parallel tasks run in > >>> parallel. Id rather be in charge of such matters myself; but I imagine > >>> adding a keyword arg for these matters would not be much trouble? > >> > >> As I understand it, that is where the strength of Cilk+ lies. It does > >> not force parallelisation, just suggests it. The decision to actually > >> spawn parallel is decided at runtime depending on the load of the > >> other cores. > > > > cilk+ guarantees that the amount of space used by a pool of P threads > > is at most P times the stack space used by the sequential version (+ a > > constant). The idea is that you can say > > > > for (i = 0; i < 1000000; i++) { > > cilk_spawn f(a[i]); > > } > > > > and it will never create more than P work items in memory, rather than > > 1e6, even if each f() spawns a bunch itself. Of course, it won't > > guarantee that OpenMP will not also spawn P threads and/or check that > > you're one of P processes cooperating on a task using multiprocessing. > > > > Personally I'd rather have an np.setnumthreads() to turn this on or > > off for a process and have the runtime distribute work for me instead > > of having to do it myself. > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion