Assume you have a list of metrics, and a list of identities. Each metric
classifies the set of identities into sub-sets identical under that metric.
For example, say the metrics are height, weight (first digit), age (decade),
sex, hair color, and you have the following identities and aspects:
ID Height Weight Age Sex Hair
John 5'10" 170# 34 M Black
Joe 5'10" 190# 25 M Black
Sally 5'3" 110# 23 F Blonde
Sammy 6'1" 210# 37 M Blonde
Then the metrics would bucket the identities like this:
Height: John Joe;Sally;Sammy
Weight: John Joe Sally;Sammy
Age: John Sammy;Joe Sally
Sex: John Joe Sammy;Sally
Hair: John Joe;Sally Sammy
Given such a set of categorizations, your job is to find the "correlation" of
the identities. The word "correlation" is in quotes because I don't mean it in
the statistical sense (I'm not even sure what the definition is there).
What I mean is, I want to know how related each identity is to each other,
where a pair is related by the number of buckets (categories) in which they
appear together (scaled by the total number of buckets). That is, a pair of
identities is equivalent if they are bucketted together by every metric (i.e.
there is no test which can tell them apart). A pair of identities is
completely different if they never appear together in a bucket. A pair of
identies is partially related if they appear together in some buckets, but not
others (i.e. their identities overlap a bit).
I already have code to do this, as a proof of concept. But I'm looking for
better, faster, leaner, cleaner code.
Put another way, I'm looking for a better way to express the following
algorithm. Can you solve this puzzle?
NB. A dyadic verb to generate some test data.
NB. x is the number of metrics, y is the number of identities.
NB. EG: (3 test_data 5) might produce
NB. EG: (0 3 2;1 4) ; (1;0 3 2;4) ; < (2;3;0 4;1)
test_data =: (?~ <;.1~ i. e. 0 , ? ? ])&.>@:#
NB. A monadic verb that find all connections.
NB. That is, given a list M produce, all possible
NB. pairings of the items in M .
NB. EG: (all_pairs i. 3) would produce
NB. EG: 4 2$0 0 0 1 1 0 1 1
all_pairs =. ,@:{@:(2 # ])&.<
NB. Monadic verb to turn the a pairings
NB. into a connection table. Stolen from
NB. the Dictionary page "20. Directed graphs."
NB. EG: (arcs2mat 0 1,1 2,:2 0 ) would produce
NB. EG: 3 3$0 1 0 0 0 1 1 0 0
max_point =. >:@:(>./)@:,
convert =. #. e.~ [: i. [ , [
arcs2mat =. convert~ max_point
NB. Makes a connection table for each metric
cxn_mats =. arcs2mat @: ; @: (all_pairs&.>)
NB. Sums the connection tables, resulting in a
NB. the "correlation" of each pair of identities.
cxn_mat =. [: +/ cxn_mats&>
NB. Scaled connections.
NB. By definition a square matrix of numbers, with
NB. shape ,~ # ; > {. y with 1 at every position
NB. on the diagonal, and symmetric about the diagonal.
scale =. ] % >./@:,
correl =: scale@:cxn_mat f.
Given the John/Joe/Sally/Sammy example above, the final categorization could be
represented as:
jjss =: (<0 1;(,2);,3),(<0 1 2;,3),(<0 3;1 2),(<0 1 3;,2),<0 1;2 3
correl jjss
1 0.8 0.2 0.4
0.8 1 0.4 0.2
0.2 0.4 1 0.2
0.4 0.2 0.2 1
Other compatible datasets can be generated using the dyadic verb test_data
where x is the number of metrics and y is the number of identities.
-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm