HyukjinKwon opened a new pull request, #41278:
URL: https://github.com/apache/spark/pull/41278
### What changes were proposed in this pull request?
This PR proposes to add the support of pyfiles (`.zip`, `.py`, `.jar`,
`.egg` files) in `SparkSession.addArtifacts`.
### Why are the changes needed?
In order for end users to add the dependencies in Python Spark Connect
client.
### Does this PR introduce _any_ user-facing change?
Yes, it adds the support of pyfiles (`.zip`, `.py`, `.jar`, `.egg` files) in
`SparkSession.addArtifacts`.
### How was this patch tested?
Manually tested via `local-cluster`.
```bash
./sbin/start-connect-server.sh --jars `ls
connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master
"local-cluster[2,2,1024]"
./bin/pyspark --remote "sc://localhost:15002"
```
```python
import os
import tempfile
from pyspark.sql.functions import udf
import shutil
with tempfile.TemporaryDirectory() as d:
package_path = os.path.join(d, "my_zipfile")
os.mkdir(package_path)
pyfile_path = os.path.join(package_path, "__init__.py")
with open(pyfile_path, "w") as f:
_ = f.write("my_func = lambda: 5")
shutil.make_archive(package_path, 'zip', d, "my_zipfile")
@udf("long")
def func(x):
import my_zipfile
return my_zipfile.my_func()
spark.addArtifacts(f"{package_path}.zip", pyfile=True)
spark.range(1).select(func("id")).show()
```
Also added a couple of unittests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]