[
https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174953#comment-13174953
]
Jeff Eastman edited comment on MAHOUT-846 at 12/22/11 6:43 PM:
---------------------------------------------------------------
The following code refactors the pdf() function to sum the exponents and
perform the exponentiation only once. It produces equivalent results to the old
version for DisplayDirichlet and the math seems solid. I introduced two helper
methods to make it all clearer. I believe it will handle wide vectors better
than the original but have yet to verify that.
{code}
@Override
public double pdf(VectorWritable vw) {
Vector x = vw.get();
Vector m = getCenter();
Vector s = getRadius().plus(0.0000001); // add a small prior to avoid divide
// by zero
return Math.exp(-(divideSquareAndSum(x.minus(m), s) / 2))
/ zProd(s.times(UncommonDistributions.SQRT2PI));
}
private double zProd(Vector s) {
double prod = 1;
for (int i = 0; i < s.size(); i++) {
prod *= s.getQuick(i);
}
return prod;
}
private double divideSquareAndSum(Vector numerator, Vector denominator) {
double result = 0;
for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
Element denom = it.next();
double quotient = numerator.getQuick(denom.index()) / denom.get();
result += quotient * quotient;
}
return result;
}
{code}
was (Author: jeastman):
The following code refactors the pdf() function to sum the exponents and
perform the exponentiation only once. It produces equivalent results to the old
version for DisplayDirichlet and the math seems solid. I introduced two helper
methods to make it all clearer. I believe it will handle wide vectors better
than the original but have yet to verify that.
{code}
public double pdf(VectorWritable vw) {
Vector x = vw.get();
Vector m = getCenter();
Vector s = getRadius().plus(0.0000001); // add a small prior to avoid
divide by zero
Vector xms2 = divideAndSquare(x.minus(m), s);
return Math.exp(-(xms2.zSum() / 2))
/ zProd(s.times(UncommonDistributions.SQRT2PI));
}
private double zProd(Vector s) {
double prod = 1;
for (int i = 0; i < s.size(); i++) {
prod *= s.getQuick(i);
}
return prod;
}
private Vector divideAndSquare(Vector numerator, Vector denominator) {
Vector result = numerator.like();
for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
Element denom = it.next();
double quotient = numerator.getQuick(denom.index()) / denom.get();
result.setQuick(denom.index(), quotient * quotient);
}
return result;
}
{code}
> Improve Scalability of Gaussian Cluster For Wide Vectors
> --------------------------------------------------------
>
> Key: MAHOUT-846
> URL: https://issues.apache.org/jira/browse/MAHOUT-846
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.5
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Fix For: 0.6
>
>
> The pdf() implementation in GaussianCluster is pretty lame. It is computing a
> running product of the element pdfs which, for wide input vectors (Reuters is
> 41,807), always underflows and returns 0. Here's the code:
> {noformat}
> public double pdf(VectorWritable vw) {
> Vector x = vw.get();
> // return the product of the component pdfs
> // TODO: is this reasonable? correct? It seems to work in some cases.
> double pdf = 1;
> for (int i = 0; i < x.size(); i++) {
> // small prior on stdDev to avoid numeric instability when stdDev==0
> pdf *= UncommonDistributions.dNorm(x.getQuick(i),
> getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
> }
> return pdf;
> {noformat}
> }
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira