Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160071888 --- Diff: python/pyspark/context.py --- @@ -980,6 +996,33 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf + def install_packages(self, packages, install_driver=True): + """ + install python packages on all executors and driver through pip + :param packages: string for single package or a list of string for multiple packages + :param install_driver: whether to install packages in client + """ + if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": + raise Exception("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled set as true") + if isinstance(packages, basestring): + packages = [packages] + num_executors = int(self._conf.get("spark.executor.instances")) + dummyRDD = self.parallelize(range(num_executors), num_executors) --- End diff -- Right, even without dynamic execution this depend on us contiuing to do uniform distribution of data with parallelize which I don't is gauranteed (and we have no test which would catch this breaking).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org