Re: [I] Provide pip install for PySpark users [datafusion-comet]

via GitHub Wed, 13 Nov 2024 10:38:09 -0800


SemyonSinchenko commented on issue #1079:
URL: 
https://github.com/apache/datafusion-comet/issues/1079#issuecomment-2474442442


   The problem here is that a `spark.plugins` config is static, so all the 
plugins should be specified before the PySpark SparkSession is created. If 
users are going to run any comet job with spark-submit of the Python script, 
they should specify JARs location and plugins manually. And the process will be 
the same as submitting JVM Spark job. In my experience it is a very rare case 
when users create a SparkSession inside a Python script and run such a script 
as a Python file, not via spark-submit.
   
   The same story applies to a Databricks (or Databricks-like) Python notebook, 
for example: SparkSession already exists inside, so because of the 
`spark.plugins` it is impossible to add Comet to the already running session. 
The only way to run Comet in Databricks is by using Databricks so called `init 
scripts` that are executed before the cluster is created. In this case, users 
should manually add the Comet JAR to the spark jars folder and specify the 
`plugin`.
   
   Another case is PySpark connect, but in that case comet JARs should be on 
the server-side and any pip-installable package cannot help there anyhow...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Provide pip install for PySpark users [datafusion-comet]

Reply via email to