[ https://issues.apache.org/jira/browse/SPARK-16561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu updated SPARK-16561: ------------------------------- Description: In `MultivariateOnlineSummarizer` min/max method, use judgement nnz(i) < weightSum, it will cause some numerial problem and make result unstable. for example, add two vector: [10, -10] with weight 1e10 [0, 0] with weight 1e-10 using MultivariateOnlineSummarizer.min/max we will get minVector = [10, -10] maxVector = [10, -10] but the right result should be minVector = [0, -10] maxVector = [10, 0] The bug reason is that (1e10 + 1e-10) == 1e10 because of the floating rounding. and different accumulating or merging order may cause different result, such as: [10, -10] with weight 1e10 [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) using the input data order listed above, we will get the result: minVector = [10, -10] maxVector = [10, -10] but if the input data order is as following: [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) [10, -10] with weight 1e10 than it the result will be: minVector = [0, -10] maxVector = [10, 0] that's because: 1e10 + 1e-7 + ... + 1e-7(add 100 times) == 1e10 but 1e-7 + ... + 1e-7(add 100 times) + 1e10 = 1.000000000000001E10 != 1e10 was: In `MultivariateOnlineSummarizer` min/max method, use judgement `nnz(i) < weightSum`, it will cause some numerial problem and make result unstable. for example, add two vector: [10, -10] with weight 1e10 [0, 0] with weight 1e-10 using MultivariateOnlineSummarizer.min/max we will get minVector = [10, -10] maxVector = [10, -10] but the right result should be minVector = [0, -10] maxVector = [10, 0] The bug reason is that (1e10 + 1e-10) == 1e10 because of the floating rounding. and different accumulating or merging order may cause different result, such as: [10, -10] with weight 1e10 [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) using the input data order listed above, we will get the result: minVector = [10, -10] maxVector = [10, -10] but if the input data order is as following: [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) [10, -10] with weight 1e10 than it the result will be: minVector = [0, -10] maxVector = [10, 0] that's because: 1e10 + 1e-7 + ... + 1e-7(add 100 times) == 1e10 but 1e-7 + ... + 1e-7(add 100 times) + 1e10 = 1.000000000000001E10 != 1e10 > Potential numerial problem in MultivariateOnlineSummarizer min/max > ------------------------------------------------------------------ > > Key: SPARK-16561 > URL: https://issues.apache.org/jira/browse/SPARK-16561 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.0.0, 2.0.1, 2.1.0 > Reporter: Weichen Xu > Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > In `MultivariateOnlineSummarizer` min/max method, > use judgement nnz(i) < weightSum, it will cause some numerial problem > and make result unstable. > for example, > add two vector: > [10, -10] with weight 1e10 > [0, 0] with weight 1e-10 > using MultivariateOnlineSummarizer.min/max we will get > minVector = [10, -10] > maxVector = [10, -10] > but the right result should be > minVector = [0, -10] > maxVector = [10, 0] > The bug reason is that > (1e10 + 1e-10) == 1e10 because of the floating rounding. > and different accumulating or merging order may cause different result, > such as: > [10, -10] with weight 1e10 > [0, 0] with weight 1e-7 > .... > (100 lines data [0, 0] with weight 1e-7) > using the input data order listed above, we will get the result: > minVector = [10, -10] > maxVector = [10, -10] > but if the input data order is as following: > [0, 0] with weight 1e-7 > .... > (100 lines data [0, 0] with weight 1e-7) > [10, -10] with weight 1e10 > than it the result will be: > minVector = [0, -10] > maxVector = [10, 0] > that's because: > 1e10 + 1e-7 + ... + 1e-7(add 100 times) == 1e10 > but > 1e-7 + ... + 1e-7(add 100 times) + 1e10 = 1.000000000000001E10 != 1e10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org