Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103550608
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends 
Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception 
if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in 
Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer 
range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft I am aware of the implications of floating point precision 
and I understand your concerns. 
    
    Having said that though, even allowing user and item Ids to be double/float 
is not a good idea. We just keep it for backwards compatibility I guess. Also 
note that the current implementation of Spark 2.1 will actually take that your 
0.9999999999999996 value and silently cast it to Int (so it becomes 0)! For me 
the only permitted types should have been Integer, Long and BigIntegers.
    
    I don't have strong opinions about refactoring anything in the Number case 
as it simply performance-wise it does not matter. The point of this PR is to 
optimize the general case where the id is Int because the casting of the 
current approach algenerates twice as much data as the original dataset (of 
course it is GCed by at a cost). 
    
    @MLnick @srowen It's your call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to