[
https://issues.apache.org/jira/browse/SPARK-49907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot reassigned SPARK-49907:
--------------------------------------
Assignee: (was: Apache Spark)
> [Connect] Support spark.ml on Connect
> -------------------------------------
>
> Key: SPARK-49907
> URL: https://issues.apache.org/jira/browse/SPARK-49907
> Project: Spark
> Issue Type: New Feature
> Components: Connect, PySpark
> Affects Versions: 4.0.0
> Reporter: Bobby Wang
> Priority: Major
> Labels: pull-request-available
>
> Starting from Apache Spark 3.4, Spark has supported Connect which introduced
> a decoupled client-server architecture that allows remote connectivity to
> Spark clusters using the DataFrame API and unresolved logical plans as the
> protocol. The separation between client and server allows Spark and its open
> ecosystem to be leveraged from everywhere. It can be embedded in modern data
> applications, in IDEs, Notebooks and programming languages.
> However, Spark Connect currently only supports Spark SQL, which means Spark
> ML could not run the training/inference via Spark Connect. It will probably
> result in losing some ML users.
> So I would like to propose a way to support Spark ML on the Connect. Users
> don't need to change their code to leverage connect to run Spark ML cases.
> Here are some links,
> Design doc: [Support spark.ml on
> Connect|https://docs.google.com/document/d/1EUvSZuI-so83cxb_fTVMoz0vUfAaFmqXt39yoHI-D9I/edit?usp=sharing]
>
> Draft PR: https://github.com/wbo4958/spark/pull/5
> Example code,
> {code:python}
> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
> df = spark.createDataFrame([
> (Vectors.dense([1.0, 2.0]), 1),
> (Vectors.dense([2.0, -1.0]), 1),
> (Vectors.dense([-3.0, -2.0]), 0),
> (Vectors.dense([-1.0, -2.0]), 0),
> ], schema=['features', 'label'])
> lr = LogisticRegression()
> lr.setMaxIter(30)
> model: LogisticRegressionModel = lr.fit(df)
> z = model.summary
> x = model.predictRaw(Vectors.dense([1.0, 2.0]))
> print(f"predictRaw {x}")
> assert model.getMaxIter() == 30
> model.summary.roc.show()
> print(model.summary.weightedRecall)
> print(model.summary.recallByLabel)
> print(model.coefficients)
> print(model.intercept)
> model.transform(df).show()
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]