[Jprogramming] Neophyte Performance Question - GROUPBY

Mike Thompson Fri, 08 Feb 2008 03:44:52 -0800

I'm experimenting with 'groupby-like' operations across columns of an
'inverted table'.


   NB.  Inverted table has three columns: x,  y,  z
   x =: 1 1 2 2 3 3 4 4
   y =: 1 1 1 2 2 2 3 3
   z =: 1 2 3 4 5 6 7 8

   sum =: +/

   xy =:  x ,. y      NB.  I want to groupby x and y
   (('x' , ' ', 'y'); 'sum z') ,: (~. xy) ; ,.  (i.~ xy) sum/. z     NB.
Sum z for distinct x, y pairs

Which yields this table:

----T-----┐
│x y│sum z│
+---+-----+
│1 1│ 3   │
│2 1│ 3   │
│2 2│ 4   │
│3 2│11   │
│4 3│15   │
L---+------

Thrilled that I can at least produce right answers, I now want to improve
the performance.

So, I've been experiementing with:

    x =: ? 10000000 $ 999
    y =: ? 10000000 $ 999
    z =: ? 10000000 $ 999

Cutting away the formatting fluff from the table-forming expression above,
the CPU consuming core is this:
     
      (~. xy) ;  (i.~ xy) sum/. z

Any suggestions on how to do this more efficiently (faster)?   For a start,
I feel as if my 
approach must be calculating the nub of xy twice.  Also, perhaps xy, as I've
created it, is a poor choice
of structure to work with (I found the nub of x was massively faster to
calculate than the nub of xy):

     Ts '~. x'
0.21264 6784

   Ts '~. xy' 
9.13063 2.68436e8


Note:  to simplify the explanation above, I used 'sum' but actually I want
to 'collect' partitions of z:

collect =: <@,  

(i.~ xy) collect/. z

----T-T-T---T---┐
│1 2│3│4│5 6│7 8│
L---+-+-+---+----

Finally, I'm keen to have a generalised form of this groupby available to
me.  Ie. Group by an arbitrary number of columns, not just two. 

Many thanks for any insights,
Mike


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] Neophyte Performance Question - GROUPBY

Reply via email to