----- Original Message ----- From: "Richard Lynch" <[EMAIL PROTECTED]>
To: <php-general@lists.php.net>
Sent: Thursday, January 11, 2007 11:29 PM
Subject: [PHP] Variance Function

Any advice?

Anybody got a good "variance" function to do what I'm trying to do?


Hey,
I've seen you solve many questions on this list, and I feel honour to be able to try and help :)

Well the solution that pops into my head is clustering. Since you have a set of numbers and 1 or more of them may be abnormal, then you can cluster them into one or more groups of similar values.

I quickly read up on clustering and coded a function to do something you might find useful.

---- cluster.php ----
<?php

function mean($arr) {
return array_sum($arr) / count($arr);
}

function find_k_clusters($arr, $k) {

if ($k <= 1)
 return array($arr);

// Setup n clusters (and their means)
$cluster = array();
$clusterMean = array();
foreach ($arr as $a) {
 $cluster[] = array($a);
 $clusterMean[] = $a;
}

//populate an array of all the differences between pairs
$diff = array();
foreach ($clusterMean as $i => $c1) {
 $diff[$i] = array();
 foreach ($clusterMean as $j => $c2) {
   // Only loop until we get to j, so we don't duplicate results
  if ($i <= $j)
   break;
  $diff[$i][$j] = abs( $c1 - $c2 );
 }
}

while ( count($cluster) > $k ) {

 // find the smallest value (hence the closest pair)
 $p1 = false;
 $p2 = false;

 foreach ($diff as $i => $diffi) {
  foreach ($diffi as $j => $d) {
   if ($p1 === false || $d < $diff[$p1][$p2]) {
    $p1 = $i;
    $p2 = $j;
   }
  }
 }

 echo "$p1 $p2\n";
 //print_r($cluster);

 // Add the 2nd cluster to the first, and remove the 2nd
 $cluster[ $p1 ] = array_merge ($cluster[ $p1 ], $cluster[ $p2 ]);
 $clusterMean[$p1] = mean( $cluster[ $p1 ] );
 unset( $cluster[ $p2 ] );
 unset( $clusterMean[ $p2 ] );

 // Now recalc any diffs that would have changed
 unset( $diff[ $p2 ] ); // Remove the $p2 row

 // Remove the p2 col
 foreach( $diff as $i => &$ds ) {
  if ( $i > $p2 ) {
   unset($ds[$p2]);
  }
 }

 // recalc the full p1 row
 foreach ($diff[$p1] as $j => $d) {
  $diff[$p1][$j] = abs( $clusterMean[$p1] - $clusterMean[$j] );
 }

}

return array_values( $cluster );
}

$a = array( 1132565342 , 0, 1132565360, 1000000, 1132565359, 1132565360, 1 );

print_r ( find_k_clusters($a, 2) ) ;


?>
-------------

Now you pass the function a array of values, and the number of clusters you wish to find. So for example entering the array
1132565342 , 0, 1132565360, 1000000, 1132565359, 1132565360, 1
will return 2 clusters like so:
[0] = 1132565342 , 1132565360, 1132565359, 1132565360
[1] = 0, 1000000, 1

It works by putting each value in its own cluster, and then finding the two "closest" clusters again and again until you are left with $k clusters. I haven't used the concept of variance.

Now its just up to you to figure out which cluster is correct, and voila you can throw away (or correct) the bad cluster values.

The problem might get more complex if you have for example dates such as 1970, 1990, 2006... Because then the 1990 will be nearer to the 2006 and be clustered in the "good" cluster. If you have values such as this you might want to change this so instead of creating k cluster, it only clusters values within a suitable distance of each other (for example within 72 hours of each other, which is a max acceptable time for a email to be bounced around).

I hope this helps in some way. If not it was fun quickly coding up a clustering algorithm :)

On reflexation it might be a lot easier to not use clustering and instead just look at todays date, and throw away any value more than X days out.

Andrew Brampton
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to