I agree; an np.setnumthreads to manage a numpy-global threadpool makes
sense to me.

Of course there are a great many cases where just spawning as many threads
as cores is a sensible default, but if this kind of behavior could not be
overridden I could see that greatly reduce performance for some of my more
complex projects

On Fri, Jan 16, 2015 at 4:11 PM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> On 01/16/2015 03:14 PM, Lars Buitinck wrote:
> > 2015-01-16 13:29 GMT+01:00  <numpy-discussion-requ...@scipy.org>:
> >> Date: Fri, 16 Jan 2015 12:43:43 +0100
> >> From: Julian Taylor <jtaylor.deb...@googlemail.com>
> >> Subject: Re: [Numpy-discussion] Sorting refactor
> >> To: Discussion of Numerical Python <numpy-discussion@scipy.org>
> >> Message-ID: <54b8f96f.7090...@googlemail.com>
> >> Content-Type: text/plain; charset=windows-1252
> >>
> >> On 16.01.2015 12:33, Lars Buitinck wrote:
> >>> * Quicksort has a quadratic time worst case. I think it should be
> >>> turned into an introsort [1] for O(n log n) worst case; we have the
> >>> heapsort needed to do that.
> >>
> >> This probably rarely happens in numeric data, and we do have guaranteed
> >> nlog runtime algorithms available.
> >
> > It's no more likely or unlikely than in any other type of data
> > (AFAIK), but it's possible for an adversary to DOS any (web server)
> > code that uses np.sort.
>
> if you are using numpy where an arbitrary user is allowed to control the
> data passed to a non isolated environment you have a problem anyway.
> numpy is far from secure software and there are likely hundreds of ways
> to produce DOS and dozens of ways to do code execution in any nontrivial
> numpy using application.
>
> >
> >> I was also thinking about this, an advantage of a sorting network is
> >> that it can be vectorized to be significantly faster than an insertion
> >> sort. Handling NaN's should also be directly possible.
> >
> > Tried that, and it didn't give any speedup, at least not without
> > explicit vector instructions.
> >
> > Just moving the NaNs aside didn't cost anything in my preliminary
> > benchmarks (without sorting nets), the cost of the operation was
> > almost exactly compensated by simpler comparisons.
>
> an SSE2 implementation a 16 entry bitonic sort is available here:
> https://github.com/mischasan/sse2/blob/master/ssesort.c
> there is also a benchmark, on my machine its 6 times faster than
> insertion sort.
> But again this would only gain us 5-10% improvement at best as the
> partition part of quicksort is still the major time consuming part.
>
> >
> >> The issue is that its probably too much complicated code for only a very
> >> small gain.
> >
> > Maybe. The thing is that the selection algorithms are optimized for
> > NaNs and seem to depend on the current comparison code. We'd need
> > distinct <TYPE>_LT and <TYPE>_LT_NONAN for each <TYPE>.
> >
> > The sorting nets themselves aren't complicated, just lengthy. My
> > branch has the length-optimal (not optimally parallel) ones for n <=
> > 16.
> >
> >> But I assume you will get the same trouble as openmp but that needs
> >> testing, also cilk+ in gcc is not really production ready yet, I got
> >> lots of crashes when I last tried it (it will be in 5.0 though).
> >
> > The data parallel constructs tend to crash the compiler, but task
> > spawning seems to be stable in 4.9.2. I've still to see how it handles
> > multiprocessing/fork.
> >
> > What do you mean by will be in 5.0, did they do a big push?
>
> gcc 5.0 changelog reports "full support for cilk plus".
> Also all bugs I have filed have been fixed in 5.0.
>
> >
> >
> >> Date: Fri, 16 Jan 2015 13:28:56 +0100
> >> From: Da?id <davidmen...@gmail.com>
> >> Subject: Re: [Numpy-discussion] Sorting refactor
> >> To: Discussion of Numerical Python <numpy-discussion@scipy.org>
> >> Message-ID:
> >>         <CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=
> iunqreiasdy23dn...@mail.gmail.com>
> >> Content-Type: text/plain; charset=UTF-8
> >>
> >> On 16 January 2015 at 13:15, Eelco Hoogendoorn
> >> <hoogendoorn.ee...@gmail.com> wrote:
> >>> Perhaps an algorithm can be made faster, but often these multicore
> >>> algorithms are also less efficient, and a less data-dependent way of
> putting
> >>> my cores to good use would have been preferable. Potentially, other
> code
> >>> could slow down due to cache trashing if too many parallel tasks run in
> >>> parallel. Id rather be in charge of such matters myself; but I imagine
> >>> adding a keyword arg for these matters would not be much trouble?
> >>
> >> As I understand it, that is where the strength of Cilk+ lies. It does
> >> not force parallelisation, just suggests it. The decision to actually
> >> spawn parallel is decided at runtime depending on the load of the
> >> other cores.
> >
> > cilk+ guarantees that the amount of space used by a pool of P threads
> > is at most P times the stack space used by the sequential version (+ a
> > constant). The idea is that you can say
> >
> > for (i = 0; i < 1000000; i++) {
> >     cilk_spawn f(a[i]);
> > }
> >
> > and it will never create more than P work items in memory, rather than
> > 1e6, even if each f() spawns a bunch itself. Of course, it won't
> > guarantee that OpenMP will not also spawn P threads and/or check that
> > you're one of P processes cooperating on a task using multiprocessing.
> >
> > Personally I'd rather have an np.setnumthreads() to turn this on or
> > off for a process and have the runtime distribute work for me instead
> > of having to do it myself.
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to