[GitHub] [spark] rvesse commented on pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

GitBox Mon, 26 Apr 2021 03:27:42 -0700


rvesse commented on pull request #13599:
URL: https://github.com/apache/spark/pull/13599#issuecomment-826717299



   Sure, but no approach is going to be perfect.
   
   If you want dynamic package resolution your best option would be to run in 
containers and build that capability into your containers entry points somehow. 
 Even then this has drawbacks in that you have every driver/executor needing to 
download and install packages at startup which can create big increases in 
startup time and/or lead to application failure if the driver/executors take 
different amounts of time to start up leading to connection timeouts.  The 
dynamic approach also fails for air-gapped environments (very common in my 
$dayjob).
   
   With the "official" Spark approach you only have to pay that cost once, and 
can do so somewhere with the necessary network connectivity to download all the 
packages you need.  Once your environment is packaged that cost is paid and you 
can re-use it as many times as you need.
   
   Your only real gotcha is that the OS environment where you build the 
environment needs to match the OS environment where you want to run it or you 
can find you has OS dependent packages that won't work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rvesse commented on pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

Reply via email to