Re: PySpark on PyPi
Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I've created an updated PR for this based off of the previous work of @prabinb: https://github.com/apache/spark/pull/8318 I am not very familiar with python packaging; feedback is appreciated. -Auberon On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote: On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com wrote: I would tentatively suggest also conda packaging. A conda package has the advantage that it can be set up without 'installing' the pyspark files, while the PyPI packaging is still being worked out. It can just add a pyspark.pth file pointing to pyspark, py4j locations. But I think it's a really good idea to package with conda. -MinRK http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if
Re: PySpark on PyPi
On Aug 20, 2015 4:57 PM, Justin Uang [via Apache Spark Developers List] ml-node+s1001551n13766...@n3.nabble.com wrote: One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the process that they use to publish releases? A setup.py, Travis.yml, tox.ini (e.g cookiecutter)? https://github.com/audreyr/cookiecutter-pypackage https://wrdrd.com/docs/tools/#python-packages * scripts=[] * package_data / MANIFEST.in * entry_points * console_scripts * https://pythonhosted.org/setuptools/setuptools.html#eggsecutable-scripts https://github.com/audreyr/cookiecutter-pypackage ... https://wrdrd.com/docs/consulting/knowledge-engineering#spark On Thu, Aug 20, 2015 at 5:44 PM Justin Uang [hidden email] wrote: I would prefer to just do it without the jar first as well. My hunch is that to run spark the way it is intended, we need the wrapper scripts, like spark-submit. Does anyone know authoritatively if that is the case? On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot [hidden email] wrote: +1 But just to improve the error logging, would it be possible to add some warn logging in pyspark when the SPARK_HOME env variable is pointing to a Spark distribution with a different version from the pyspark package ? Regards, Olivier. 2015-08-20 22:43 GMT+02:00 Brian Granger [hidden email]: I would start with just the plain python package without the JAR and then see if it makes sense to add the JAR over time. On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez [hidden email] wrote: Hi all, I wanted to bubble up a conversation from the PR to this discussion to see if there is support the idea of including a Spark assembly JAR in a PyPI release of pyspark. @holdenk recommended this as she already does so in the Sparkling Pandas package. Is this something people are interesting in pursuing? -Auberon On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger [hidden email] wrote: Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez [hidden email] wrote: Hi all, I've created an updated PR for this based off of the previous work of @prabinb: https://github.com/apache/spark/pull/8318 I am not very familiar with python packaging; feedback is appreciated. -Auberon On Mon, Aug 10, 2015 at 12:45 PM, MinRK [hidden email] wrote: On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman [hidden email] wrote: I would tentatively suggest also conda packaging. A conda package has the advantage that it can be set up without 'installing' the pyspark files, while the PyPI packaging is still being worked out. It can just add a pyspark.pth file pointing to pyspark, py4j locations. But I think it's a really good idea to package with conda. -MinRK http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu [hidden email] wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger [hidden email] wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu [hidden email] wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot [hidden email] wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang [hidden email]: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman [hidden email] wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper
Re: PySpark on PyPi
I would start with just the plain python package without the JAR and then see if it makes sense to add the JAR over time. On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I wanted to bubble up a conversation from the PR to this discussion to see if there is support the idea of including a Spark assembly JAR in a PyPI release of pyspark. @holdenk recommended this as she already does so in the Sparkling Pandas package. Is this something people are interesting in pursuing? -Auberon On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com wrote: Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I've created an updated PR for this based off of the previous work of @prabinb: https://github.com/apache/spark/pull/8318 I am not very familiar with python packaging; feedback is appreciated. -Auberon On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote: On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com wrote: I would tentatively suggest also conda packaging. A conda package has the advantage that it can be set up without 'installing' the pyspark files, while the PyPI packaging is still being worked out. It can just add a pyspark.pth file pointing to pyspark, py4j locations. But I think it's a really good idea to package with conda. -MinRK http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies.
Re: PySpark on PyPi
I would prefer to just do it without the jar first as well. My hunch is that to run spark the way it is intended, we need the wrapper scripts, like spark-submit. Does anyone know authoritatively if that is the case? On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 But just to improve the error logging, would it be possible to add some warn logging in pyspark when the SPARK_HOME env variable is pointing to a Spark distribution with a different version from the pyspark package ? Regards, Olivier. 2015-08-20 22:43 GMT+02:00 Brian Granger elliso...@gmail.com: I would start with just the plain python package without the JAR and then see if it makes sense to add the JAR over time. On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I wanted to bubble up a conversation from the PR to this discussion to see if there is support the idea of including a Spark assembly JAR in a PyPI release of pyspark. @holdenk recommended this as she already does so in the Sparkling Pandas package. Is this something people are interesting in pursuing? -Auberon On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com wrote: Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I've created an updated PR for this based off of the previous work of @prabinb: https://github.com/apache/spark/pull/8318 I am not very familiar with python packaging; feedback is appreciated. -Auberon On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote: On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com wrote: I would tentatively suggest also conda packaging. A conda package has the advantage that it can be set up without 'installing' the pyspark files, while the PyPI packaging is still being worked out. It can just add a pyspark.pth file pointing to pyspark, py4j locations. But I think it's a really good idea to package with conda. -MinRK http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com : // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // +
Re: PySpark on PyPi
One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the process that they use to publish releases? On Thu, Aug 20, 2015 at 5:44 PM Justin Uang justin.u...@gmail.com wrote: I would prefer to just do it without the jar first as well. My hunch is that to run spark the way it is intended, we need the wrapper scripts, like spark-submit. Does anyone know authoritatively if that is the case? On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 But just to improve the error logging, would it be possible to add some warn logging in pyspark when the SPARK_HOME env variable is pointing to a Spark distribution with a different version from the pyspark package ? Regards, Olivier. 2015-08-20 22:43 GMT+02:00 Brian Granger elliso...@gmail.com: I would start with just the plain python package without the JAR and then see if it makes sense to add the JAR over time. On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I wanted to bubble up a conversation from the PR to this discussion to see if there is support the idea of including a Spark assembly JAR in a PyPI release of pyspark. @holdenk recommended this as she already does so in the Sparkling Pandas package. Is this something people are interesting in pursuing? -Auberon On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com wrote: Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com wrote: Hi all, I've created an updated PR for this based off of the previous work of @prabinb: https://github.com/apache/spark/pull/8318 I am not very familiar with python packaging; feedback is appreciated. -Auberon On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote: On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com wrote: I would tentatively suggest also conda packaging. A conda package has the advantage that it can be set up without 'installing' the pyspark files, while the PyPI packaging is still being worked out. It can just add a pyspark.pth file pointing to pyspark, py4j locations. But I think it's a really good idea to package with conda. -MinRK http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use.
Re: PySpark on PyPi
I've help to build a conda installable spark packages in the past. You can an older recipe here: https://github.com/conda/conda-recipes/tree/master/spark And I've been updating packages here: https://anaconda.org/anaconda-cluster/spark `conda install -c anaconda-cluster spark` The above should work for OSX/Linux-64 and py27/py34 --Ben -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13659.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: PySpark on PyPi
Matt Goodman wrote I would tentatively suggest also conda packaging. http://conda.pydata.org/docs/ $ conda skeleton pypi pyspark # update git_tag and git_uri # add test commands (import pyspark; import pyspark.[...]) Docs for building conda packages for multiple operating systems and interpreters from PyPi packages: * http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html * https://github.com/audreyr/cookiecutter/issues/232 Matt Goodman wrote --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu lt; davies@ gt; wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger lt; ellisonbg@ gt; wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu lt; davies@ gt; wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot lt; o.girardot@ gt; wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang lt; justin.uang@ gt;: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jeremy@ wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark ), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.biswal@ wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang lt; justin.uang@ gt; wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot lt; o.girardot@ gt; wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and
Re: PySpark on PyPi
I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering
Re: PySpark on PyPi
I would tentatively suggest also conda packaging. http://conda.pydata.org/docs/ --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote: I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote: Sorry, trying to follow the context here. Does it look like there is support for the idea of creating a setup.py file and pypi package for pyspark? Cheers, Brian On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark ), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory
Re: PySpark on PyPi
We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam
Re: PySpark on PyPi
// ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI? On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com wrote: Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. *Davies*, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with
Re: PySpark on PyPi
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com mailto:justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464 https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu mailto:j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark
Re: PySpark on PyPi
// + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. *Davies*, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER=local[4]? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext(local[4])` in my notebook. I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote: This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Considering the python API as just a front needing the SPARK_HOME defined anyway, I think it would be interesting to deploy the Python part of Spark on PyPi in order to handle the dependencies in a Python project needing PySpark via pip. For now I just symlink the python/pyspark in my python install dir site-packages/ in order for PyCharm or other lint tools to work properly. I can do the setup.py work or anything. What do you think ? Regards,
Re: PySpark on PyPi
I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. *Davies*, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER=local[4]? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext(local[4])` in my notebook. I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote: This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com
Re: PySpark on PyPi
This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Considering the python API as just a front needing the SPARK_HOME defined anyway, I think it would be interesting to deploy the Python part of Spark on PyPi in order to handle the dependencies in a Python project needing PySpark via pip. For now I just symlink the python/pyspark in my python install dir site-packages/ in order for PyCharm or other lint tools to work properly. I can do the setup.py work or anything. What do you think ? Regards, Olivier.
Re: PySpark on PyPi
Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER=local[4]? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext(local[4])` in my notebook. I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote: This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Considering the python API as just a front needing the SPARK_HOME defined anyway, I think it would be interesting to deploy the Python part of Spark on PyPi in order to handle the dependencies in a Python project needing PySpark via pip. For now I just symlink the python/pyspark in my python install dir site-packages/ in order for PyCharm or other lint tools to work properly. I can do the setup.py work or anything. What do you think ? Regards, Olivier.
Re: PySpark on PyPi
Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER=local[4]? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext(local[4])` in my notebook. I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote: This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Considering the python API as just a front needing the SPARK_HOME defined anyway, I think it would be interesting to deploy the Python part of Spark on PyPi in order to handle the dependencies in a Python project needing PySpark via pip. For now I just symlink the python/pyspark in my python install dir site-packages/ in order for PyCharm or other lint tools to work properly. I can do the setup.py work or anything. What do you think ? Regards, Olivier.