Re: [PHP] Variance Function

Andrew Brampton Thu, 11 Jan 2007 20:22:10 -0800

----- Original Message -----From: "Richard Lynch" <[EMAIL PROTECTED]>

To: <php-general@lists.php.net>
Sent: Thursday, January 11, 2007 11:29 PM
Subject: [PHP] Variance Function

Any advice?

Anybody got a good "variance" function to do what I'm trying to do?


Hey,

I've seen you solve many questions on this list, and I feel honour to beable to try and help :)

Well the solution that pops into my head is clustering. Since you have a setof numbers and 1 or more of them may be abnormal, then you can cluster theminto one or more groups of similar values.

I quickly read up on clustering and coded a function to do something youmight find useful.


---- cluster.php ----
<?php

function mean($arr) {
return array_sum($arr) / count($arr);
}

function find_k_clusters($arr, $k) {

if ($k <= 1)
 return array($arr);

// Setup n clusters (and their means)
$cluster = array();
$clusterMean = array();
foreach ($arr as $a) {
 $cluster[] = array($a);
 $clusterMean[] = $a;
}

//populate an array of all the differences between pairs
$diff = array();
foreach ($clusterMean as $i => $c1) {
 $diff[$i] = array();
 foreach ($clusterMean as $j => $c2) {
   // Only loop until we get to j, so we don't duplicate results
  if ($i <= $j)
   break;
  $diff[$i][$j] = abs( $c1 - $c2 );
 }
}

while ( count($cluster) > $k ) {

 // find the smallest value (hence the closest pair)
 $p1 = false;
 $p2 = false;

 foreach ($diff as $i => $diffi) {
  foreach ($diffi as $j => $d) {
   if ($p1 === false || $d < $diff[$p1][$p2]) {
    $p1 = $i;
    $p2 = $j;
   }
  }
 }

 echo "$p1 $p2\n";
 //print_r($cluster);

 // Add the 2nd cluster to the first, and remove the 2nd
 $cluster[ $p1 ] = array_merge ($cluster[ $p1 ], $cluster[ $p2 ]);
 $clusterMean[$p1] = mean( $cluster[ $p1 ] );
 unset( $cluster[ $p2 ] );
 unset( $clusterMean[ $p2 ] );

 // Now recalc any diffs that would have changed
 unset( $diff[ $p2 ] ); // Remove the $p2 row

 // Remove the p2 col
 foreach( $diff as $i => &$ds ) {
  if ( $i > $p2 ) {
   unset($ds[$p2]);
  }
 }

 // recalc the full p1 row
 foreach ($diff[$p1] as $j => $d) {
  $diff[$p1][$j] = abs( $clusterMean[$p1] - $clusterMean[$j] );
 }

}

return array_values( $cluster );
}

$a = array( 1132565342 , 0, 1132565360, 1000000, 1132565359, 1132565360,1 );


print_r ( find_k_clusters($a, 2) ) ;


?>
-------------

Now you pass the function a array of values, and the number of clusters youwish to find. So for example entering the array

1132565342 , 0, 1132565360, 1000000, 1132565359, 1132565360, 1
will return 2 clusters like so:
[0] = 1132565342 , 1132565360, 1132565359, 1132565360
[1] = 0, 1000000, 1

It works by putting each value in its own cluster, and then finding the two"closest" clusters again and again until you are left with $k clusters. Ihaven't used the concept of variance.

Now its just up to you to figure out which cluster is correct, and voila youcan throw away (or correct) the bad cluster values.

The problem might get more complex if you have for example dates such as1970, 1990, 2006... Because then the 1990 will be nearer to the 2006 and beclustered in the "good" cluster. If you have values such as this you mightwant to change this so instead of creating k cluster, it only clustersvalues within a suitable distance of each other (for example within 72 hoursof each other, which is a max acceptable time for a email to be bouncedaround).

I hope this helps in some way. If not it was fun quickly coding up aclustering algorithm :)

On reflexation it might be a lot easier to not use clustering and insteadjust look at todays date, and throw away any value more than X days out.

Andrew Brampton

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Variance Function

Reply via email to