I'm experimenting with 'groupby-like' operations across columns of an
'inverted table'.
NB. Inverted table has three columns: x, y, z
x =: 1 1 2 2 3 3 4 4
y =: 1 1 1 2 2 2 3 3
z =: 1 2 3 4 5 6 7 8
sum =: +/
xy =: x ,. y NB. I want to groupby x and y
(('x' , ' ', 'y'); 'sum z') ,: (~. xy) ; ,. (i.~ xy) sum/. z NB.
Sum z for distinct x, y pairs
Which yields this table:
----T-----┐
│x y│sum z│
+---+-----+
│1 1│ 3 │
│2 1│ 3 │
│2 2│ 4 │
│3 2│11 │
│4 3│15 │
L---+------
Thrilled that I can at least produce right answers, I now want to improve
the performance.
So, I've been experiementing with:
x =: ? 10000000 $ 999
y =: ? 10000000 $ 999
z =: ? 10000000 $ 999
Cutting away the formatting fluff from the table-forming expression above,
the CPU consuming core is this:
(~. xy) ; (i.~ xy) sum/. z
Any suggestions on how to do this more efficiently (faster)? For a start,
I feel as if my
approach must be calculating the nub of xy twice. Also, perhaps xy, as I've
created it, is a poor choice
of structure to work with (I found the nub of x was massively faster to
calculate than the nub of xy):
Ts '~. x'
0.21264 6784
Ts '~. xy'
9.13063 2.68436e8
Note: to simplify the explanation above, I used 'sum' but actually I want
to 'collect' partitions of z:
collect =: <@,
(i.~ xy) collect/. z
----T-T-T---T---┐
│1 2│3│4│5 6│7 8│
L---+-+-+---+----
Finally, I'm keen to have a generalised form of this groupby available to
me. Ie. Group by an arbitrary number of columns, not just two.
Many thanks for any insights,
Mike
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm