[ 
https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174953#comment-13174953
 ] 

Jeff Eastman edited comment on MAHOUT-846 at 12/22/11 8:41 PM:
---------------------------------------------------------------

The following code refactors the pdf() function to sum the exponents and 
perform the exponentiation only once. It produces equivalent results to the old 
version for DisplayDirichlet and the math seems solid. I introduced two helper 
methods to make it all clearer. I believe it will handle wide vectors better 
than the original but have yet to verify that.

{code}
  @Override
  public double pdf(VectorWritable vw) {
    Vector x = vw.get();
    Vector m = getCenter();
    Vector s = getRadius().plus(0.0000001); // add a small prior to avoid 
divide by zero
    return Math.exp(-(divideSquareAndSum(x.minus(m), s) / 2)) / zProdSqt2Pi(s);
  }
  
  private double zProdSqt2Pi(Vector s) {
    double prod = 1;
    for (int i = 0; i < s.size(); i++) {
      prod *= s.getQuick(i) * UncommonDistributions.SQRT2PI;
    }
    return prod;
  }
  
  private double divideSquareAndSum(Vector numerator, Vector denominator) {
    double result = 0;
    for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
      Element denom = it.next();
      double quotient = numerator.getQuick(denom.index()) / denom.get();
      result += quotient * quotient;
    }
    return result;
  }
{code}
                
      was (Author: jeastman):
    The following code refactors the pdf() function to sum the exponents and 
perform the exponentiation only once. It produces equivalent results to the old 
version for DisplayDirichlet and the math seems solid. I introduced two helper 
methods to make it all clearer. I believe it will handle wide vectors better 
than the original but have yet to verify that.

{code}
  @Override
  public double pdf(VectorWritable vw) {
    Vector x = vw.get();
    Vector m = getCenter();
    Vector s = getRadius().plus(0.0000001); // add a small prior to avoid divide
                                            // by zero
    return Math.exp(-(divideSquareAndSum(x.minus(m), s) / 2))
        / zProd(s.times(UncommonDistributions.SQRT2PI));
  }
  
  private double zProd(Vector s) {
    double prod = 1;
    for (int i = 0; i < s.size(); i++) {
      prod *= s.getQuick(i);
    }
    return prod;
  }
  
  private double divideSquareAndSum(Vector numerator, Vector denominator) {
    double result = 0;
    for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
      Element denom = it.next();
      double quotient = numerator.getQuick(denom.index()) / denom.get();
      result += quotient * quotient;
    }
    return result;
  }
{code}
                  
> Improve Scalability of Gaussian Cluster For Wide Vectors
> --------------------------------------------------------
>
>                 Key: MAHOUT-846
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-846
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.6
>
>
> The pdf() implementation in GaussianCluster is pretty lame. It is computing a 
> running product of the element pdfs which, for wide input vectors (Reuters is 
> 41,807), always underflows and returns 0. Here's the code:
> {noformat}
>   public double pdf(VectorWritable vw) {
>     Vector x = vw.get();
>     // return the product of the component pdfs
>     // TODO: is this reasonable? correct? It seems to work in some cases.
>     double pdf = 1;
>     for (int i = 0; i < x.size(); i++) {
>       // small prior on stdDev to avoid numeric instability when stdDev==0
>       pdf *= UncommonDistributions.dNorm(x.getQuick(i),
>           getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
>     }
>     return pdf;
> {noformat}
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to