GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19029

    [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance 
generate negative result

    ## What changes were proposed in this pull request?
    
    Because of numerical error, MultivariateOnlineSummarizer.variance is 
possible to generate negative variance.
    **This is a serious bug because many algos in MLLib use stddev computed 
from sqrt(variance), **
    **it will generate NaN and crash the whole algorithm.**
    we can reproduce this bug use the following code:
    ```
        val summarizer1 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.7)
        val summarizer2 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
        val summarizer3 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.5)
        val summarizer4 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
    
        val summarizer = summarizer1
          .merge(summarizer2)
          .merge(summarizer3)
          .merge(summarizer4)
    
        println(summarizer.variance(0))
    ```
    This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` 
and `ml.stat.SummarizerBuffer.variance` (The latter one is newly added which 
has similar logic)
    
    ## How was this patch tested?
    
    test cases added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark fix_summarizer_var_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19029.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19029
    
----
commit 9c92730bc3588596b348932ea285b12c5a4a77ce
Author: WeichenXu <[email protected]>
Date:   2017-08-23T10:52:56Z

    init pr

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to