On 11/08/2012 07:55 PM, Dag Sverre Seljebotn wrote: > On 11/08/2012 06:59 PM, Francesc Alted wrote: >> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote: >>> On 11/08/2012 06:06 PM, Francesc Alted wrote: >>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote: >>>>> On 11/07/2012 08:41 PM, Neal Becker wrote: >>>>>> Would you expect numexpr without MKL to give a significant boost? >>>>> If you need higher performance than what numexpr can give without >>>>> using >>>>> MKL, you could look at code such as this: >>>>> >>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 >>>> Hey, that's cool. I was a bit disappointed not finding this sort of >>>> work in open space. It seems that this lacks threading support, but >>>> that should be easy to implement by using OpenMP directives. >>> IMO this is the wrong place to introduce threading; each thread should >>> call expd_v on its chunks. (Which I think is how you said numexpr >>> currently uses VML anyway.) >> >> Oh sure, but then you need a blocked engine for performing the >> computations too. And yes, by default numexpr uses its own threading > > I just meant that you can use a chunked OpenMP for-loop wherever in your > code that you call expd_v. A "five-line blocked engine", if you like :-) > > IMO that's the right location since entering/exiting OpenMP blocks takes > some time. > >> code rather than the existing one in VML (but that can be changed by >> playing with set_num_threads/set_vml_num_threads). It always stroked to >> me as a little strange that the internal threading in numexpr was more >> efficient than VML one, but I suppose this is because the latter is more >> optimized to deal with large blocks instead of those of medium size (4K) >> in numexpr. > > I don't know enough about numexpr to understand this :-) > > I guess I just don't see the motivation to use VML threading or why it > should be faster? If you pass a single 4K block to a threaded VML call > then I could easily see lots of performance problems: a) > starting/stopping threads or signalling the threads of a pool is a > constant overhead per "parallel section", b) unless you're very careful > to only have VML touch the data, and VML always schedules elements in > the exact same way, you're going to have the cache lines of that 4K > block shuffled between L1 caches of different cores for different > operations...
c) Your "effective block size" is then 4KB/ncores. (Unless you scale the block size by ncores). DS _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion