[GitHub] [spark] MaxGekk commented on a change in pull request #20793: [SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter

GitBox Wed, 20 Mar 2019 00:24:09 -0700

MaxGekk commented on a change in pull request #20793: 
[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the 
seed parameter
URL: https://github.com/apache/spark/pull/20793#discussion_r267209690

##########
File path:
mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala
##########
@@ -48,7 +48,7 @@ class PowerIterationClusteringSuite extends SparkFunSuite
with MLlibTestSparkCon
// Generate two circles following the example in the PIC paper.
val r1 = 1.0
val n1 = 10
- val r2 = 4.0
+ val r2 = 14.0

Review comment:
I don't know how to pass it as is. I played with PIC slightly in master
(without the changes) by modifying seed
[there](https://github.com/apache/spark/blob/25bcf59b3b566b77bfc8a40a4f4253b81f340aa4/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L318)
like:
```
val random = new XORShiftRandom(part + 1234)
```
The test can pass maybe in 10% of seeds I put there. Not sure we can expect
deterministic behaviour of test. Maybe there are some issues in implementation
of PIC.

Just in case, in general, I suppose in the PR that existing implementations
of algorithms are correct and tolerant to the seed of pseudo-random generator.
I don't think I should fix the implementations here. I would propose to open
JIRA tickets, and fix issues there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #20793: [SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter

Reply via email to