westurner wrote > > Matt Goodman wrote >> I would tentatively suggest also conda packaging. >> >> http://conda.pydata.org/docs/ > $ conda skeleton pypi pyspark > # update git_tag and git_uri > # add test commands (import pyspark; import pyspark.[...]) > > Docs for building conda packages for multiple operating systems and > interpreters from PyPi packages: > > * > http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html > * https://github.com/audreyr/cookiecutter/issues/232
* conda meta.yaml can specify e.g. a test.sh script(s) that should return 0 Docs: http://conda.pydata.org/docs/building/meta-yaml.html#test-section Wes Turner wrote > > Matt Goodman wrote >> --Matthew Goodman >> >> ===================== >> Check Out My Website: http://craneium.net >> Find me on LinkedIn: http://tinyurl.com/d6wlch >> >> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu < >> davies@ >> > wrote: >> >>> I think so, any contributions on this are welcome. >>> >>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger < >> ellisonbg@ >> > >>> wrote: >>> > Sorry, trying to follow the context here. Does it look like there is >>> > support for the idea of creating a setup.py file and pypi package for >>> > pyspark? >>> > >>> > Cheers, >>> > >>> > Brian >>> > >>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu < >> davies@ >> > >>> wrote: >>> >> We could do that after 1.5 released, it will have same release cycle >>> >> as Spark in the future. >>> >> >>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot >>> >> < >> o.girardot@ >> > wrote: >>> >>> +1 (once again :) ) >>> >>> >>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang < >> justin.uang@ >> >: >>> >>>> >>> >>>> // ping >>> >>>> >>> >>>> do we have any signoff from the pyspark devs to submit a PR to >>> publish to >>> >>>> PyPI? >>> >>>> >>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman < >>> >> freeman.jeremy@ >>> >>> >>>> wrote: >>> >>>>> >>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of >>> value in >>> >>>>> steps that make it easier to use PySpark as an ordinary python >>> library. >>> >>>>> >>> >>>>> You might want to check out this >>> (https://github.com/minrk/findspark >>> ), >>> >>>>> started by Jupyter project devs, that offers one way to facilitate >>> this >>> >>>>> stuff. I’ve also cced them here to join the conversation. >>> >>>>> >>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios >>> (I’ve >>> done >>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run >>> PySpark jobs >>> >>>>> just using `from pyspark import SparkContext; sc = >>> SparkContext(master=“X”)` >>> >>>>> so long as the environmental variables (PYTHONPATH and >>> PYSPARK_PYTHON) are >>> >>>>> set correctly on *both* workers and driver. That said, there’s >>> definitely >>> >>>>> additional configuration / functionality that would require going >>> through >>> >>>>> the proper submit scripts. >>> >>>>> >>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal < >>> >> punya.biswal@ >>> >>> >>>>> wrote: >>> >>>>> >>> >>>>> I agree with everything Justin just said. An additional advantage >>> of >>> >>>>> publishing PySpark's Python code in a standards-compliant way is >>> the >>> fact >>> >>>>> that we'll be able to declare transitive dependencies (Pandas, >>> Py4J) >>> in a >>> >>>>> way that pip can use. Contrast this with the current situation, >>> where >>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work >>> until you >>> >>>>> install Pandas. >>> >>>>> >>> >>>>> Punya >>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang < >> justin.uang@ >> > >>> >>>>> wrote: >>> >>>>>> >>> >>>>>> // + Davies for his comments >>> >>>>>> // + Punya for SA >>> >>>>>> >>> >>>>>> For development and CI, like Olivier mentioned, I think it would >>> be >>> >>>>>> hugely beneficial to publish pyspark (only code in the python/ >>> dir) >>> on PyPI. >>> >>>>>> If anyone wants to develop against PySpark APIs, they need to >>> download the >>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools >>> (pylint, >>> >>>>>> pytest, IDE code completion). Right now that involves adding >>> python/ and >>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to >>> add >>> more >>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH >>> munging in >>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which >>> declares >>> its >>> >>>>>> dependencies, and a published distribution, depending on pyspark >>> will just >>> >>>>>> be adding pyspark to my setup.py dependencies. >>> >>>>>> >>> >>>>>> Of course, if we actually want to run parts of pyspark that is >>> backed by >>> >>>>>> Py4J calls, then we need the full spark distribution with either >>> ./pyspark >>> >>>>>> or ./spark-submit, but for things like linting and development, >>> the >>> >>>>>> PYTHONPATH munging is very annoying. >>> >>>>>> >>> >>>>>> I don't think the version-mismatch issues are a compelling reason >>> to not >>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely >>> enforce that >>> >>>>>> the version has to be exact, which means there is no backcompat >>> nightmare as >>> >>>>>> suggested by Davies in >>> https://issues.apache.org/jira/browse/SPARK-1267. >>> >>>>>> This would mean that even if the user got his pip installed >>> pyspark >>> to >>> >>>>>> somehow get loaded before the spark distribution provided >>> pyspark, >>> then the >>> >>>>>> user would be alerted immediately. >>> >>>>>> >>> >>>>>> Davies, if you buy this, should me or someone on my team pick up >>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and >>> >>>>>> https://github.com/apache/spark/pull/464? >>> >>>>>> >>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot >>> >>>>>> < >> o.girardot@ >> > wrote: >>> >>>>>>> >>> >>>>>>> Ok, I get it. Now what can we do to improve the current >>> situation, >>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I >>> have >>> to : >>> >>>>>>> 1- download a pre-built version of pyspark and unzip it >>> somewhere >>> on >>> >>>>>>> every agent >>> >>>>>>> 2- define the SPARK_HOME env >>> >>>>>>> 3- symlink this distribution pyspark dir inside the python >>> install >>> dir >>> >>>>>>> site-packages/ directory >>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV >>> >>>>>>> project), I have to (except if I'm mistaken) >>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific >>> directory >>> >>>>>>> on every agent >>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's >>> additional >>> >>>>>>> classpath using the conf/spark-default file >>> >>>>>>> >>> >>>>>>> Then finally we can launch our unit/integration-tests. >>> >>>>>>> Some issues are related to spark-packages, some to the lack of >>> >>>>>>> python-based dependency, and some to the way SparkContext are >>> launched when >>> >>>>>>> using pyspark. >>> >>>>>>> I think step 1 and 2 are fair enough >>> >>>>>>> 4 and 5 may already have solutions, I didn't check and >>> considering >>> >>>>>>> spark-shell is downloading such dependencies automatically, I >>> think if >>> >>>>>>> nothing's done yet it will (I guess ?). >>> >>>>>>> >>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution >>> would >>> be >>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb >>> spark >>> >>>>>>> distribution in PyPi, maybe there's a better compromise ? >>> >>>>>>> >>> >>>>>>> Regards, >>> >>>>>>> >>> >>>>>>> Olivier. >>> >>>>>>> >>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam < >> jey@.berkeley >> > a >>> écrit >>> >>>>>>> : >>> >>>>>>>> >>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just >>> serves >>> >>>>>>>> as a shim to an existing Spark installation? Or it could even >>> download the >>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during >>> installation. >>> Right now, >>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. >>> For >>> example, >>> >>>>>>>> why do I need to use a strange incantation when booting up >>> IPython if I want >>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would >>> be >>> much nicer >>> >>>>>>>> to just type `from pyspark import SparkContext; sc = >>> >>>>>>>> SparkContext("local[4]")` in my notebook. >>> >>>>>>>> >>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do >>> pass >>> when >>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH: >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH >>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py >>> >>>>>>>> >>> >>>>>>>> -Jey >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen < >> rosenville@ >> > > >>> >>>>>>>> wrote: >>> >>>>>>>>> >>> >>>>>>>>> This has been proposed before: >>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 >>> >>>>>>>>> >>> >>>>>>>>> There's currently tighter coupling between the Python and Java >>> halves >>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did >>> this, I bet >>> >>>>>>>>> we'd run into tons of issues when users try to run a newer >>> version of the >>> >>>>>>>>> Python half of PySpark against an older set of Java components >>> or >>> >>>>>>>>> vice-versa. >>> >>>>>>>>> >>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot >>> >>>>>>>>> < >> o.girardot@ >> > wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Hi everyone, >>> >>>>>>>>>> Considering the python API as just a front needing the >>> SPARK_HOME >>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the >>> Python part of >>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python >>> project >>> >>>>>>>>>> needing PySpark via pip. >>> >>>>>>>>>> >>> >>>>>>>>>> For now I just symlink the python/pyspark in my python >>> install >>> dir >>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to >>> work >>> properly. >>> >>>>>>>>>> I can do the setup.py work or anything. >>> >>>>>>>>>> >>> >>>>>>>>>> What do you think ? >>> >>>>>>>>>> >>> >>>>>>>>>> Regards, >>> >>>>>>>>>> >>> >>>>>>>>>> Olivier. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>> >>> >>>>> >>> >>> >>> > >>> > >>> > >>> > -- >>> > Brian E. Granger >>> > Cal Poly State University, San Luis Obispo >>> > @ellisonbg on Twitter and GitHub >>> > >> bgranger@ >> and >> ellisonbg@ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: >> dev-unsubscribe@.apache >>> For additional commands, e-mail: >> dev-help@.apache >>> >>> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13637.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org