zhengruifeng opened a new pull request, #40432:
URL: https://github.com/apache/spark/pull/40432

   ### What changes were proposed in this pull request?
   Implement ml function `{array_to_vector, vector_to_array}`
   
   
   ### Why are the changes needed?
   function parity
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new functions
   
   ### How was this patch tested?
   added ut and manually check
   
   ```
   (spark_dev) ➜  spark git:(connect_ml_functions) ✗ bin/pyspark --remote 
"local[*]"    
   Python 3.9.16 (main, Mar  8 2023, 04:29:24) 
   Type 'copyright', 'credits' or 'license' for more information
   IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help.
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   23/03/15 11:56:27 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
         /_/
   
   Using Python version 3.9.16 (main, Mar  8 2023 04:29:24)
   Client connected to the Spark Connect server at localhost
   SparkSession available as 'spark'.
   
   In [1]: 
   
   In [1]:         query = """
      ...:             SELECT * FROM VALUES
      ...:             (1, 4, ARRAY(1.0, 2.0, 3.0)),
      ...:             (1, 2, ARRAY(-1.0, -2.0, -3.0))
      ...:             AS tab(a, b, c)
      ...:             """
   
   In [2]: cdf = spark.sql(query)
   
   In [3]:     from pyspark.sql.connect.ml import functions as CF
   
   In [4]: cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d"))
   
   In [5]: cdf1.show()
   +---+----------------+                                              (0 + 1) 
/ 1]
   |  a|               d|
   +---+----------------+
   |  1|   [1.0,2.0,3.0]|
   |  1|[-1.0,-2.0,-3.0]|
   +---+----------------+
   
   
   In [6]: cdf1.schema
   Out[6]: StructType([StructField('a', IntegerType(), False), StructField('d', 
VectorUDT(), True)])
   
   In [7]: cdf1.select(CF.vector_to_array(cdf1.d))
   Out[7]: DataFrame[UDF(d): array<double>]
   
   In [8]: cdf1.select(CF.vector_to_array(cdf1.d)).show()
   +------------------+
   |            UDF(d)|
   +------------------+
   |   [1.0, 2.0, 3.0]|
   |[-1.0, -2.0, -3.0]|
   +------------------+
   
   
   In [9]: cdf1.select(CF.vector_to_array(cdf1.d)).schema
   Out[9]: StructType([StructField('UDF(d)', ArrayType(DoubleType(), False), 
False)])
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to