rvesse commented on pull request #13599: URL: https://github.com/apache/spark/pull/13599#issuecomment-826717299
Sure, but no approach is going to be perfect. If you want dynamic package resolution your best option would be to run in containers and build that capability into your containers entry points somehow. Even then this has drawbacks in that you have every driver/executor needing to download and install packages at startup which can create big increases in startup time and/or lead to application failure if the driver/executors take different amounts of time to start up leading to connection timeouts. The dynamic approach also fails for air-gapped environments (very common in my $dayjob). With the "official" Spark approach you only have to pay that cost once, and can do so somewhere with the necessary network connectivity to download all the packages you need. Once your environment is packaged that cost is paid and you can re-use it as many times as you need. Your only real gotcha is that the OS environment where you build the environment needs to match the OS environment where you want to run it or you can find you has OS dependent packages that won't work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
