Dan:
A suggestion that may make it easier.
Represent each identity as a vector att0 att1 att2 .... where each att is
a numerical value of an attribute, which you can get from nub index in
each column. In your terminology, att is a bucket number. Then
c=:[: (+/%#) =
is a dyad giving the correlation of two vectors, e.g.
1 2 3 4 c 1 1 3 4
0.75
Best wishes,
John
Dan Bron wrote:
> Assume you have a list of metrics, and a list of identities. Each metric
> classifies the set of identities into sub-sets identical under that
> metric.
>
> For example, say the metrics are height, weight (first digit), age
> (decade), sex, hair color, and you have the following identities and
> aspects:
>
> ID Height Weight Age Sex Hair
> John 5'10" 170# 34 M Black
> Joe 5'10" 190# 25 M Black
> Sally 5'3" 110# 23 F Blonde
> Sammy 6'1" 210# 37 M Blonde
>
> Then the metrics would bucket the identities like this:
>
> Height: John Joe;Sally;Sammy
> Weight: John Joe Sally;Sammy
> Age: John Sammy;Joe Sally
> Sex: John Joe Sammy;Sally
> Hair: John Joe;Sally Sammy
>
> Given such a set of categorizations, your job is to find the "correlation"
> of the identities. The word "correlation" is in quotes because I don't
> mean it in the statistical sense (I'm not even sure what the definition is
> there).
>
> What I mean is, I want to know how related each identity is to each other,
> where a pair is related by the number of buckets (categories) in which
> they appear together (scaled by the total number of buckets). That is, a
> pair of identities is equivalent if they are bucketted together by every
> metric (i.e. there is no test which can tell them apart). A pair of
> identities is completely different if they never appear together in a
> bucket. A pair of identies is partially related if they appear together
> in some buckets, but not others (i.e. their identities overlap a bit).
>
> I already have code to do this, as a proof of concept. But I'm looking
> for better, faster, leaner, cleaner code.
> Put another way, I'm looking for a better way to express the following
> algorithm. Can you solve this puzzle?
>
> NB. A dyadic verb to generate some test data.
> NB. x is the number of metrics, y is the number of identities.
> NB. EG: (3 test_data 5) might produce
> NB. EG: (0 3 2;1 4) ; (1;0 3 2;4) ; < (2;3;0 4;1)
> test_data =: (?~ <;.1~ i. e. 0 , ? ? ])&.>@:#
>
> NB. A monadic verb that find all connections.
> NB. That is, given a list M produce, all possible
> NB. pairings of the items in M .
> NB. EG: (all_pairs i. 3) would produce
> NB. EG: 4 2$0 0 0 1 1 0 1 1
> all_pairs =. ,@:{@:(2 # ])&.<
>
> NB. Monadic verb to turn the a pairings
> NB. into a connection table. Stolen from
> NB. the Dictionary page "20. Directed graphs."
> NB. EG: (arcs2mat 0 1,1 2,:2 0 ) would produce
> NB. EG: 3 3$0 1 0 0 0 1 1 0 0
> max_point =. >:@:(>./)@:,
> convert =. #. e.~ [: i. [ , [
> arcs2mat =. convert~ max_point
>
> NB. Makes a connection table for each metric
> cxn_mats =. arcs2mat @: ; @: (all_pairs&.>)
>
> NB. Sums the connection tables, resulting in a
> NB. the "correlation" of each pair of identities.
> cxn_mat =. [: +/ cxn_mats&>
>
> NB. Scaled connections.
> NB. By definition a square matrix of numbers, with
> NB. shape ,~ # ; > {. y with 1 at every position
> NB. on the diagonal, and symmetric about the diagonal.
> scale =. ] % >./@:,
> correl =: scale@:cxn_mat f.
>
> Given the John/Joe/Sally/Sammy example above, the final categorization
> could be represented as:
>
> jjss =: (<0 1;(,2);,3),(<0 1 2;,3),(<0 3;1 2),(<0 1 3;,2),<0 1;2 3
> correl jjss
> 1 0.8 0.2 0.4
> 0.8 1 0.4 0.2
> 0.2 0.4 1 0.2
> 0.4 0.2 0.2 1
>
> Other compatible datasets can be generated using the dyadic verb
> test_data where x is the number of metrics and y is the number of
> identities.
>
> -Dan
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm