zhengruifeng opened a new pull request, #40432: URL: https://github.com/apache/spark/pull/40432
### What changes were proposed in this pull request? Implement ml function `{array_to_vector, vector_to_array}` ### Why are the changes needed? function parity ### Does this PR introduce _any_ user-facing change? yes, new functions ### How was this patch tested? added ut and manually check ``` (spark_dev) ➜ spark git:(connect_ml_functions) ✗ bin/pyspark --remote "local[*]" Python 3.9.16 (main, Mar 8 2023, 04:29:24) Type 'copyright', 'credits' or 'license' for more information IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/15 11:56:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0 /_/ Using Python version 3.9.16 (main, Mar 8 2023 04:29:24) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. In [1]: In [1]: query = """ ...: SELECT * FROM VALUES ...: (1, 4, ARRAY(1.0, 2.0, 3.0)), ...: (1, 2, ARRAY(-1.0, -2.0, -3.0)) ...: AS tab(a, b, c) ...: """ In [2]: cdf = spark.sql(query) In [3]: from pyspark.sql.connect.ml import functions as CF In [4]: cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d")) In [5]: cdf1.show() +---+----------------+ (0 + 1) / 1] | a| d| +---+----------------+ | 1| [1.0,2.0,3.0]| | 1|[-1.0,-2.0,-3.0]| +---+----------------+ In [6]: cdf1.schema Out[6]: StructType([StructField('a', IntegerType(), False), StructField('d', VectorUDT(), True)]) In [7]: cdf1.select(CF.vector_to_array(cdf1.d)) Out[7]: DataFrame[UDF(d): array<double>] In [8]: cdf1.select(CF.vector_to_array(cdf1.d)).show() +------------------+ | UDF(d)| +------------------+ | [1.0, 2.0, 3.0]| |[-1.0, -2.0, -3.0]| +------------------+ In [9]: cdf1.select(CF.vector_to_array(cdf1.d)).schema Out[9]: StructType([StructField('UDF(d)', ArrayType(DoubleType(), False), False)]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org