[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

imatiach-msft Tue, 28 Feb 2017 12:42:24 -0800

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103547149
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends 
Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception 
if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in 
Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer 
range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    sorry, with regards to the fractional part, computations like:
    0.764553836902861+0.0701367068431045+0.165309456254034 == 1
    fail -- I strongly encourage to use a threshold for floating point numbers 
from IEEE-754 standard, since these values could be coming from anywhere - eg, 
they might have been in a sql database or some other location and when we load 
we can get strange errors.  For example, the following test case used to work 
in the previous code:
    
      test("verify can run on double values") {
        val spark = this.spark
        import spark.implicits._
    
        val als = new ALS().setMaxIter(1).setRank(1)
        val df = Seq(
          (0D, 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 3.0),
          (0D, 0D, 2.0),
          (0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 0D, 5.0)
        ).toDF("user", "item", "rating")
        df.show()
        val model = als.fit(df)
      }
    
    with the new code it fails with error:
    
    Caused by: java.lang.IllegalArgumentException: ALS only supports values in 
Integer range for columns user and item. Value 0.9999999999999996 was out of 
Integer range.
    
    This value was very close to 1 however, and mathematically the values sum 
to 1: 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, but due to 
IEEE floating point representation we can run into such issues.
    If we do insist on breaking existing users by removing the rounding, we 
should at least allow them an escape hatch to specify the precision.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Reply via email to