Dan:

A suggestion that may make it easier.

Represent each identity as a vector att0 att1 att2 .... where each att is
a numerical value of an attribute, which you can get from nub index in
each column.  In your terminology, att is a bucket number. Then

c=:[: (+/%#) =

is a dyad giving the correlation of two vectors, e.g.

   1 2 3 4 c 1 1 3 4
0.75

Best wishes,

John

Dan Bron wrote:
> Assume you have a list of metrics, and a list of identities.  Each metric
> classifies the set of identities into sub-sets identical under that
> metric.
>
> For example, say the metrics are height, weight (first digit), age
> (decade), sex, hair color, and you have the following identities and
> aspects:
>
>       ID      Height  Weight  Age     Sex     Hair
>       John    5'10"   170#    34      M       Black
>       Joe     5'10"   190#    25      M       Black
>       Sally   5'3"    110#    23      F       Blonde
>       Sammy   6'1"    210#    37      M       Blonde
>
> Then the metrics would bucket the identities like this:
>
>       Height: John Joe;Sally;Sammy
>       Weight: John Joe Sally;Sammy
>       Age:    John Sammy;Joe Sally
>       Sex:    John Joe Sammy;Sally
>       Hair:   John Joe;Sally Sammy
>
> Given such a set of categorizations, your job is to find the "correlation"
> of the identities.  The word "correlation" is in quotes because I don't
> mean it in the statistical sense (I'm not even sure what the definition is
> there).
>
> What I mean is, I want to know how related each identity is to each other,
> where a pair is related by the number of buckets (categories) in which
> they appear together (scaled by the total number of buckets).  That is, a
> pair of identities is equivalent if they are bucketted together by every
> metric (i.e. there is no test which can tell them apart).  A pair of
> identities is completely different if they never appear together in a
> bucket.  A pair of identies is partially related if they appear together
> in some buckets, but not others (i.e. their identities overlap a bit).
>
> I already have code to do this, as a proof of concept.  But I'm looking
> for better, faster, leaner, cleaner code.
> Put another way, I'm looking for a better way to express the following
> algorithm.  Can you solve this puzzle?
>
>       NB.  A dyadic verb to generate some test data.
>       NB.  x is the number of metrics, y is the number of identities.
>       NB.  EG:    (3 test_data 5)  might produce
>       NB.  EG:    (0 3 2;1 4) ; (1;0 3 2;4) ; < (2;3;0 4;1)
>       test_data =: (?~ <;.1~ i. e. 0 , ? ? ])&.>@:#
>
>       NB.  A monadic verb that find all connections.
>       NB.  That is, given a list  M  produce, all possible
>       NB.  pairings of the items in  M  .
>       NB.  EG:  (all_pairs  i. 3)   would produce
>       NB.  EG:  4 2$0 0 0 1 1 0 1 1
>       all_pairs =. ,@:{@:(2 # ])&.<
>
>       NB.  Monadic verb to turn the a pairings
>       NB.  into a connection table.  Stolen from
>       NB.  the Dictionary page  "20.  Directed graphs."
>       NB.  EG:  (arcs2mat  0 1,1 2,:2 0 )  would produce
>       NB.  EG:  3 3$0 1 0 0 0 1 1 0 0
>       max_point =. >:@:(>./)@:,
>       convert   =. #. e.~ [: i. [ , [
>       arcs2mat  =. convert~ max_point
>
>       NB.  Makes a connection table for each metric
>       cxn_mats  =. arcs2mat @: ; @: (all_pairs&.>)
>
>       NB.  Sums the connection tables, resulting in a
>       NB.  the "correlation" of each pair of identities.
>       cxn_mat   =. [: +/ cxn_mats&>
>
>       NB.  Scaled connections.
>       NB.  By definition a square matrix of numbers, with
>       NB.  shape  ,~ # ; > {. y  with   1   at every position
>       NB.  on the diagonal, and symmetric about the diagonal.
>       scale     =. ] % >./@:,
>       correl    =: scale@:cxn_mat f.
>
> Given the John/Joe/Sally/Sammy example above, the final categorization
> could be represented as:
>
>          jjss   =:  (<0 1;(,2);,3),(<0 1 2;,3),(<0 3;1 2),(<0 1 3;,2),<0 1;2 3
>          correl jjss
>         1 0.8 0.2 0.4
>       0.8   1 0.4 0.2
>       0.2 0.4   1 0.2
>       0.4 0.2 0.2   1
>
> Other compatible datasets can be generated using the dyadic verb
> test_data  where  x  is the number of metrics and  y  is the number of
> identities.
>
> -Dan
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to