I was going to spend my evening writing more parallel c, but I decided it
would probably be more relaxing (if less exciting) to write some parallel
j instead. So here we are: some parallel reimplementations of primitives,
in increasing order of interest:
First, some basic definitions to start us off:
0&T.@'' ::] ^:_'' NB.eat all the cores
NPAR=. 1 T.''
Now, let's try to implement rank, the quintessence of parallel iteration
in j. Actually, let's just implement monadic "_1, which iterates in
parallel over the major cells of y. The obvious formulation is:
peach0=: <@:((t.'')"_1)
but this is almost certainly not what we want; virtual nouns cannot be
passed between threads, so this will incur copies which will be expensive
unless u is expensive and y's major cells are small. We also need y to
have a small number of cells, or else we will cause scheduler contention
(or, in the current implementation, we'll block everything on the main
thread).
peach0 is important as an archetype of all forms of parallel iteration,
because the limitations I mentioned are fundamental to the current
implementation. However it's necessary to layer additional abstractions
on top of this primitive to get an actually performant parallel
implementation of "_1.
Here's a first testcase:
a=. i.4 1e8
timex'+/"_1 a'
0.225933
timex'+/peach0 a'
0.911939
The solution to this problem that I presented the other day is to pass a
task identifier to the task, and use it to construct the virtual noun
there instead of in the creating thread, viz:
peach1=: {{ > ([: u {&y) t.''"_1 i.#y }}
now we see an improvement!
timex'+/peach1 a'
0.0738621
This is not completely general, though; it starts up a thread for _every_
major cell of y; if there are many small cells, this will perform quite
badly. A better strategy then will be to start up NPAR tasks, giving each
a contiguous slice of values, and using the primitive u"n to operate on
the cells within that slice. Here I use the dyads {. and }. to construct
that slice:
prankb=: {{ par=. NPAR <. #y
c=. (#y) <.@% par NB.chunk size
d=. c*i.par NB.drops, see also +/\par#c
t=. (c#~<:par) , c+par|#y NB.takes, last of which may be oversized
([: u"n {&t {. {&d }. y"_) t.''"_1 i.par }}
prank=: ;@:prank
peach2=: prank _1
Another nice thing about this definition is that, since it also uses the
primitive ", it is able to take advantage of IRS once hardware parallelism
runs out. Performance compares favourably on a transposed a:
a=. i.1e8 4
timex'+/"_1 a'
0.458253
timex'+/peach2 a'
0.311498
but it is not quite so impressive as in the previous case. The majority
of the time is actually spent in ;, copying result cells out of boxes,
because there are so many results, and each is individually so cheap to
compute. It's for this reason I also expose the definition of prankb,
which leaves its result chunks in their boxes, as it may be possible to
profit by operating on those results in their boxes. If, for instance, we
try to sum up all the elements in the array, the numbers get nicer:
timex'+/+/"_1 a'
0.529847
timex'+/+/&>+/prankb _1 a'
0.130087
we'll make this even faster later.
There is more work to be done on prank; aside from the obvious bug and the
lack of ambivalence, it assumes there are enough major cells to satisfy
all threads. In the event that there are fewer major cells than NPAR, and
n is deeper than _1, it would be able to profit by subdividing along other
axes. I haven't bothered doing this yet.
I also made a parallel implementation of x u\ y, but it proved not very
interesting, so I won't reproduce it here. It is effectively the same as
prankb--including the metacircular reversion to the primitive \--except
that d and t need to be computed differently. I think that the template
used there is a general one. It's effectively the same as how you would
write it in c.
Next: reductions.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm