Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

via GitHub Tue, 17 Oct 2023 01:57:47 -0700


zhengruifeng commented on PR #43380:
URL: https://github.com/apache/spark/pull/43380#issuecomment-1765976728


   ```
   scala> import org.apache.spark.ml.linalg._
        |
        | val df = Seq.range(0, 1000000).map(i => 
(i,Vectors.dense(Array.fill(256)(1.0)))).toDF("i", "vec")
        |
   import org.apache.spark.ml.linalg._
   val df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
   
   scala> df.count()
   23/10/17 16:28:31 WARN TaskSetManager: Stage 0 contains a task of very large 
size (1391 KiB). The maximum recommended task size is 1000 KiB.
   val res0: Long = 1000000
   
   scala> val validateUDF = udf { vector: Vector =>
        |     vector match {
        |       case dv: DenseVector =>
        |         dv.values.forall(v => !v.isNaN && !v.isInfinity)
        |       case sv: SparseVector =>
        |         sv.values.forall(v => !v.isNaN && !v.isInfinity)
        |     }
        |   }
   val validateUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
SparkUserDefinedFunction($Lambda$4177/0x000000b00198fa00@17640714,BooleanType,List(Some(class[value[0]:
 vector])),Some(class[value[0]: boolean]),None,false,true)
   
   scala> val validatedCol = forall(unwrap_udt(col("vec")).getField("values"), 
v => not(v.isNaN) && abs(v) =!= expr("double('inf')"))
   val validatedCol: org.apache.spark.sql.Column = 
forall(unwrap_udt(vec)[values], lambdafunction(and(`!`(isNaN(x_0)), 
`!`(`=`(abs(x_0), double(inf)))), x_0))
   
   scala> val start = System.currentTimeMillis; 
df.select(bool_and(validateUDF(col("vec")))).head(); System.currentTimeMillis - 
start
   23/10/17 16:28:47 WARN TaskSetManager: Stage 3 contains a task of very large 
size (176683 KiB). The maximum recommended task size is 1000 KiB.
   val start: Long = 1697531323779
   val res1: Long = 4562
   
   scala>
   
   scala> val start = System.currentTimeMillis; 
df.select(bool_and(validatedCol)).head(); System.currentTimeMillis - start; 
System.currentTimeMillis - start
   23/10/17 16:28:52 WARN TaskSetManager: Stage 6 contains a task of very large 
size (176683 KiB). The maximum recommended task size is 1000 KiB.
   val start: Long = 1697531329903
   val res2: Long = 4637
   ```
   
   I did a quick test, it seems there is no significant change. Especially, 
this validation is only performed once.
   
   Using built-in functions can help simplify the codes and get potential 
benefit from SQL optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45547][ML] Validate Vectors with built-in function [spark]

Reply via email to