Hi Y'all, Over the past few years a lot of people have talked to me about their difficulty managing dependencies with PySpark. Some folks and I (not as part of the official Spark project or anything official like that) put together a quick proof-of-concept library called "coffee-boat" to make it a bit easier with PySpark and we'd love your early feedback.
It should support both packaging all of the dependencies in advance for efficiency, with support for adding that one last package you forgot in the middle of your notebook (less efficiently). This _should_ work regardless of the cluster manager (with the exception of localmode because it has very different addFile behaviour) but I've only tested* it on standalone & YARN. My limited testing shows it to be resilient to worker resets/restarts but I'm sure the real world will come up with more ways to make things fail. It is more than a little hacky (e.g. it depends on sed). The repo is at https://github.com/nteract/coffee_boat & we have some starter issues if anyone is looking to contribute https://github.com/nteract/coffee_boat/issues. If this looks like an ok path forward and we work out some of the kinks I'll send this over to user@ in awhile. Cheers, Holden *For a very loose version of the word "tested" -- Twitter: https://twitter.com/holdenkarau