Assume you have a list of metrics, and a list of identities.  Each metric 
classifies the set of identities into sub-sets identical under that metric.

For example, say the metrics are height, weight (first digit), age (decade), 
sex, hair color, and you have the following identities and aspects:

        ID      Height  Weight  Age     Sex     Hair  
        John    5'10"   170#    34      M       Black 
        Joe     5'10"   190#    25      M       Black 
        Sally   5'3"    110#    23      F       Blonde
        Sammy   6'1"    210#    37      M       Blonde

Then the metrics would bucket the identities like this:

        Height: John Joe;Sally;Sammy
        Weight: John Joe Sally;Sammy
        Age:    John Sammy;Joe Sally
        Sex:    John Joe Sammy;Sally
        Hair:   John Joe;Sally Sammy

Given such a set of categorizations, your job is to find the "correlation" of 
the identities.  The word "correlation" is in quotes because I don't mean it in 
the statistical sense (I'm not even sure what the definition is there).

What I mean is, I want to know how related each identity is to each other, 
where a pair is related by the number of buckets (categories) in which they 
appear together (scaled by the total number of buckets).  That is, a pair of 
identities is equivalent if they are bucketted together by every metric (i.e. 
there is no test which can tell them apart).  A pair of identities is 
completely different if they never appear together in a bucket.  A pair of 
identies is partially related if they appear together in some buckets, but not 
others (i.e. their identities overlap a bit).

I already have code to do this, as a proof of concept.  But I'm looking for 
better, faster, leaner, cleaner code.  
Put another way, I'm looking for a better way to express the following 
algorithm.  Can you solve this puzzle? 

        NB.  A dyadic verb to generate some test data.  
        NB.  x is the number of metrics, y is the number of identities.
        NB.  EG:    (3 test_data 5)  might produce 
        NB.  EG:    (0 3 2;1 4) ; (1;0 3 2;4) ; < (2;3;0 4;1)
        test_data =: (?~ <;.1~ i. e. 0 , ? ? ])&.>@:# 
        
        NB.  A monadic verb that find all connections.
        NB.  That is, given a list  M  produce, all possible 
        NB.  pairings of the items in  M  .
        NB.  EG:  (all_pairs  i. 3)   would produce
        NB.  EG:  4 2$0 0 0 1 1 0 1 1
        all_pairs =. ,@:{@:(2 # ])&.<
        
        NB.  Monadic verb to turn the a pairings 
        NB.  into a connection table.  Stolen from 
        NB.  the Dictionary page  "20.  Directed graphs."
        NB.  EG:  (arcs2mat  0 1,1 2,:2 0 )  would produce
        NB.  EG:  3 3$0 1 0 0 0 1 1 0 0
        max_point =. >:@:(>./)@:,        
        convert   =. #. e.~ [: i. [ , [
        arcs2mat  =. convert~ max_point
        
        NB.  Makes a connection table for each metric
        cxn_mats  =. arcs2mat @: ; @: (all_pairs&.>) 
        
        NB.  Sums the connection tables, resulting in a 
        NB.  the "correlation" of each pair of identities.
        cxn_mat   =. [: +/ cxn_mats&>
        
        NB.  Scaled connections.
        NB.  By definition a square matrix of numbers, with
        NB.  shape  ,~ # ; > {. y  with   1   at every position
        NB.  on the diagonal, and symmetric about the diagonal.
        scale     =. ] % >./@:,
        correl    =: scale@:cxn_mat f.

Given the John/Joe/Sally/Sammy example above, the final categorization could be 
represented as:

           jjss   =:  (<0 1;(,2);,3),(<0 1 2;,3),(<0 3;1 2),(<0 1 3;,2),<0 1;2 3
           correl jjss
          1 0.8 0.2 0.4
        0.8   1 0.4 0.2
        0.2 0.4   1 0.2
        0.4 0.2 0.2   1

Other compatible datasets can be generated using the dyadic verb  test_data  
where  x  is the number of metrics and  y  is the number of identities.

-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to