[Connect] Install additional python packages after session creation

Deependra Patel Fri, 24 Jan 2025 08:00:41 -0800

Hi all,
There are ways through the `addArtifacts` API in an existing session but
for that we need to have dependencies properly gzipped. In the case of
different kernel/OS between client and server, it won't work either I
believe. What I am interested in is doing some sort of `pip install
<package` on the cluster from my client.


I came across this databricks video Dependency management in Spark connect
<https://youtu.be/PbvIak6Z8eI?feature=shared&t=679>, where there was
mention of following functionality but I don't see it in the master branch
<https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/udf.py>.
Is it only supported in Databricks and no plans of open source in near
future?

```
@udf(packages=["pandas==1.5.3", "pyarrow"]
def myudf():
    import pandas
```
-----
I had another question about extending the Spark connect client (& server)
itself if I want to add a new Spark connect gRPC API. Is there a way to add
an additional proto to my package (that extends SparkSession from pyspark)?
I looked into Spark connect plugins and they are only to modify the plan
etc, not for adding a new API.

Regards,
Deependra

[Connect] Install additional python packages after session creation

Reply via email to