WeichenXu123 commented on a change in pull request #27522: [WIP][SPARK-30762]
Add dtype=float32 support to vector_to_array UDF
URL: https://github.com/apache/spark/pull/27522#discussion_r377038754
##########
File path: mllib/src/main/scala/org/apache/spark/ml/functions.scala
##########
@@ -21,21 +21,33 @@ import org.apache.spark.annotation.Since
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.mllib.linalg.{Vector => OldVector}
import org.apache.spark.sql.Column
+import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.functions.udf
// scalastyle:off
@Since("3.0.0")
object functions {
// scalastyle:on
- private val vectorToArrayUdf = udf { vec: Any =>
- vec match {
- case v: Vector => v.toArray
- case v: OldVector => v.toArray
- case v => throw new IllegalArgumentException(
- "function vector_to_array requires a non-null input argument and input
type must be " +
- "`org.apache.spark.ml.linalg.Vector` or
`org.apache.spark.mllib.linalg.Vector`, " +
- s"but got ${ if (v == null) "null" else v.getClass.getName }.")
+ private val vectorToArrayUdf = udf { (vec: Any, dtype: String) => {
+ val new_vec =
+ vec match {
+ case v: Vector => v.toArray
+ case v: OldVector => v.toArray
+ case v => throw new IllegalArgumentException(
+ "function vector_to_array requires a non-null input argument and
input type must be " +
+ "`org.apache.spark.ml.linalg.Vector` or
`org.apache.spark.mllib.linalg.Vector`, " +
+ s"but got ${ if (v == null) "null" else v.getClass.getName }.")
+ }
+ if (dtype == "float64") {
+ new_vec
+ } else if (dtype == "float32") {
+ new_vec.map(_.toFloat)
Review comment:
For sparse vector & float32 type case, here we convert it to double array
and then convert it to float array.
We should do a simple optimization:
1. create float array first.
2. use `SparseVector.foreachActive` to fill this array.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]