[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hao Ren updated SPARK-18581: ---------------------------- Description: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. was: When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc. After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix. The computation is simplified by using pseudo-determinant of a positive defined matrix. However, if the eigen value is all between 0 and 1, log(pseudo-determinant) will be a negative number like, -50. As a result, the logpdf becomes positive (pdf > 1) The related code that the following: // In function: MultivariateGaussian.calculateCovarianceConstants() {code} val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum {code} d is the eigen value vector here. If lots of its elements are between 0 and 1, then logPseudoDetSigma could be negative. Maybe we should just use the breeze 'det' opertion on sigma to get the right but slow answer instead of a quick, wrong one. > MultivariateGaussian does not check if covariance matrix is invertible > ---------------------------------------------------------------------- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.2, 2.0.2 > Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. However, if the eigen value is all between 0 and 1, > log(pseudo-determinant) will be a negative number like, -50. As a result, > the logpdf becomes positive (pdf > 1) > The related code that the following: > // In function: MultivariateGaussian.calculateCovarianceConstants() > {code} > val logPseudoDetSigma = d.activeValuesIterator.filter(_ > > tol).map(math.log).sum > {code} > d is the eigen value vector here. If lots of its elements are between 0 and > 1, then logPseudoDetSigma could be negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org