Henry Davidge created SPARK-28140:
-------------------------------------
Summary: Pyspark API to create spark.mllib RowMatrix from DataFrame
Key: SPARK-28140
URL: https://issues.apache.org/jira/browse/SPARK-28140
Project: Spark
Issue Type: Improvement
Components: MLlib, PySpark
Affects Versions: 3.0.0
Reporter: Henry Davidge
Since many functions are only implemented in spark.mllib, it is often necessary
to convert DataFrames of spark.ml vectors to spark.mllib distributed matrix
formats. The first step, converting the spark.ml vectors to the spark.mllib
equivalent, is straightforward. However, to the best of my knowledge it's not
possible to convert the resulting DataFrame to a RowMatrix without using a
python lambda function, which can have a significant performance hit. In my
recent use case, SVD took 3.5m using the Scala API, but 12m using Python.
To get around this performance hit, I propose adding a constructor to the
Pyspark RowMatrix class that accepts a DataFrame with a single column of
spark.mllib vectors. I'd be happy to add an equivalent API for IndexedRowMatrix
if there is demand.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]