Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
I decided to provide a few more bencharks in order to alleviate some of the
concerns raised by @srowen.
To reproduce the results add the following snippet in the ALSSuite class:
```scala
test("Speed difference") {
val (training, test) =
genExplicitTestData(numUsers = 200, numItems = 400, rank = 2,
noiseStd = 0.01)
val runs = 1000
var totalTime = 0.0
println("Performing "+runs+" runs")
println("Run Id,Time (secs)")
for(i <- 0 until runs) {
val t0 = System.currentTimeMillis
testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01,
targetRMSE = 0.1)
val secs = (System.currentTimeMillis - t0)/1000.0
println(i+","+secs)
totalTime += secs
}
println("AVG Execution Time: "+(totalTime/runs)+" secs")
}
```
To test both solutions, I collected 1000 samples for each (took ~1 hour for
each). Here you can see the detailed output for the
[original](http://pastebin.com/ys9Vejs9) and the
[proposed](http://pastebin.com/dCpkyMGc) code.
| Code | Mean Execution Time | Std |
| --- | --- | --- |
| Original | 4.75521 | 0.81237 |
| Proposed | 4.56276 | 0.72790 |
Using an unpaired t-test to compare the two means we find that the proposed
code is faster and the result is statistically significant (p-value < .0001).
Below I summarize why I believe the original code needs to change:
1. Casting user and item ids into double and then to integer is a hacky &
indirect way to validate that the ids are numerical and within integer range.
The proposed code covers all the corner cases in a clear and direct way. As an
added bonus, it handles Doubles and Floats with fractional part.
2. Given that the ALS implementation requires ids with int values, it is
expected that the majority of users encode their Ids as Integer. The proposed
solution avoids any casting in that case while reducing the casting in all the
other cases. This avoids putting unnecessary strain on the garbage collector,
something that you can observe if you profile the execution on a large dataset.
3. The proposed solution is not slower than the original; if something it
is slightly faster.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]