Matt Goodman wrote > I would tentatively suggest also conda packaging. > > http://conda.pydata.org/docs/
$ conda skeleton pypi pyspark # update git_tag and git_uri # add test commands (import pyspark; import pyspark.[...]) Docs for building conda packages for multiple operating systems and interpreters from PyPi packages: * http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html * https://github.com/audreyr/cookiecutter/issues/232 Matt Goodman wrote > --Matthew Goodman > > ===================== > Check Out My Website: http://craneium.net > Find me on LinkedIn: http://tinyurl.com/d6wlch > > On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu < > davies@ > > wrote: > >> I think so, any contributions on this are welcome. >> >> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger < > ellisonbg@ > > >> wrote: >> > Sorry, trying to follow the context here. Does it look like there is >> > support for the idea of creating a setup.py file and pypi package for >> > pyspark? >> > >> > Cheers, >> > >> > Brian >> > >> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu < > davies@ > > >> wrote: >> >> We could do that after 1.5 released, it will have same release cycle >> >> as Spark in the future. >> >> >> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot >> >> < > o.girardot@ > > wrote: >> >>> +1 (once again :) ) >> >>> >> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang < > justin.uang@ > >: >> >>>> >> >>>> // ping >> >>>> >> >>>> do we have any signoff from the pyspark devs to submit a PR to >> publish to >> >>>> PyPI? >> >>>> >> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman < >> > freeman.jeremy@ >> >> >>>> wrote: >> >>>>> >> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of >> value in >> >>>>> steps that make it easier to use PySpark as an ordinary python >> library. >> >>>>> >> >>>>> You might want to check out this >> (https://github.com/minrk/findspark >> ), >> >>>>> started by Jupyter project devs, that offers one way to facilitate >> this >> >>>>> stuff. I’ve also cced them here to join the conversation. >> >>>>> >> >>>>> Also, @Jey, I can also confirm that at least in some scenarios >> (I’ve >> done >> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run >> PySpark jobs >> >>>>> just using `from pyspark import SparkContext; sc = >> SparkContext(master=“X”)` >> >>>>> so long as the environmental variables (PYTHONPATH and >> PYSPARK_PYTHON) are >> >>>>> set correctly on *both* workers and driver. That said, there’s >> definitely >> >>>>> additional configuration / functionality that would require going >> through >> >>>>> the proper submit scripts. >> >>>>> >> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal < >> > punya.biswal@ >> >> >>>>> wrote: >> >>>>> >> >>>>> I agree with everything Justin just said. An additional advantage >> of >> >>>>> publishing PySpark's Python code in a standards-compliant way is >> the >> fact >> >>>>> that we'll be able to declare transitive dependencies (Pandas, >> Py4J) >> in a >> >>>>> way that pip can use. Contrast this with the current situation, >> where >> >>>>> df.toPandas() exists in the Spark API but doesn't actually work >> until you >> >>>>> install Pandas. >> >>>>> >> >>>>> Punya >> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang < > justin.uang@ > > >> >>>>> wrote: >> >>>>>> >> >>>>>> // + Davies for his comments >> >>>>>> // + Punya for SA >> >>>>>> >> >>>>>> For development and CI, like Olivier mentioned, I think it would >> be >> >>>>>> hugely beneficial to publish pyspark (only code in the python/ >> dir) >> on PyPI. >> >>>>>> If anyone wants to develop against PySpark APIs, they need to >> download the >> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools >> (pylint, >> >>>>>> pytest, IDE code completion). Right now that involves adding >> python/ and >> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add >> more >> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH >> munging in >> >>>>>> the ./pyspark script. With a proper pyspark setup.py which >> declares >> its >> >>>>>> dependencies, and a published distribution, depending on pyspark >> will just >> >>>>>> be adding pyspark to my setup.py dependencies. >> >>>>>> >> >>>>>> Of course, if we actually want to run parts of pyspark that is >> backed by >> >>>>>> Py4J calls, then we need the full spark distribution with either >> ./pyspark >> >>>>>> or ./spark-submit, but for things like linting and development, >> the >> >>>>>> PYTHONPATH munging is very annoying. >> >>>>>> >> >>>>>> I don't think the version-mismatch issues are a compelling reason >> to not >> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely >> enforce that >> >>>>>> the version has to be exact, which means there is no backcompat >> nightmare as >> >>>>>> suggested by Davies in >> https://issues.apache.org/jira/browse/SPARK-1267. >> >>>>>> This would mean that even if the user got his pip installed >> pyspark >> to >> >>>>>> somehow get loaded before the spark distribution provided pyspark, >> then the >> >>>>>> user would be alerted immediately. >> >>>>>> >> >>>>>> Davies, if you buy this, should me or someone on my team pick up >> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and >> >>>>>> https://github.com/apache/spark/pull/464? >> >>>>>> >> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot >> >>>>>> < > o.girardot@ > > wrote: >> >>>>>>> >> >>>>>>> Ok, I get it. Now what can we do to improve the current >> situation, >> >>>>>>> because right now if I want to set-up a CI env for PySpark, I >> have >> to : >> >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere >> on >> >>>>>>> every agent >> >>>>>>> 2- define the SPARK_HOME env >> >>>>>>> 3- symlink this distribution pyspark dir inside the python >> install >> dir >> >>>>>>> site-packages/ directory >> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV >> >>>>>>> project), I have to (except if I'm mistaken) >> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific >> directory >> >>>>>>> on every agent >> >>>>>>> 5- add this jar-filled directory to the Spark distribution's >> additional >> >>>>>>> classpath using the conf/spark-default file >> >>>>>>> >> >>>>>>> Then finally we can launch our unit/integration-tests. >> >>>>>>> Some issues are related to spark-packages, some to the lack of >> >>>>>>> python-based dependency, and some to the way SparkContext are >> launched when >> >>>>>>> using pyspark. >> >>>>>>> I think step 1 and 2 are fair enough >> >>>>>>> 4 and 5 may already have solutions, I didn't check and >> considering >> >>>>>>> spark-shell is downloading such dependencies automatically, I >> think if >> >>>>>>> nothing's done yet it will (I guess ?). >> >>>>>>> >> >>>>>>> For step 3, maybe just adding a setup.py to the distribution >> would >> be >> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb >> spark >> >>>>>>> distribution in PyPi, maybe there's a better compromise ? >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> >> >>>>>>> Olivier. >> >>>>>>> >> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam < > jey@.berkeley > > a >> écrit >> >>>>>>> : >> >>>>>>>> >> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just >> serves >> >>>>>>>> as a shim to an existing Spark installation? Or it could even >> download the >> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. >> Right now, >> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. >> For >> example, >> >>>>>>>> why do I need to use a strange incantation when booting up >> IPython if I want >> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be >> much nicer >> >>>>>>>> to just type `from pyspark import SparkContext; sc = >> >>>>>>>> SparkContext("local[4]")` in my notebook. >> >>>>>>>> >> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do >> pass >> when >> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH >> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py >> >>>>>>>> >> >>>>>>>> -Jey >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen < > rosenville@ > > > >> >>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>> This has been proposed before: >> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 >> >>>>>>>>> >> >>>>>>>>> There's currently tighter coupling between the Python and Java >> halves >> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did >> this, I bet >> >>>>>>>>> we'd run into tons of issues when users try to run a newer >> version of the >> >>>>>>>>> Python half of PySpark against an older set of Java components >> or >> >>>>>>>>> vice-versa. >> >>>>>>>>> >> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot >> >>>>>>>>> < > o.girardot@ > > wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi everyone, >> >>>>>>>>>> Considering the python API as just a front needing the >> SPARK_HOME >> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the >> Python part of >> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python >> project >> >>>>>>>>>> needing PySpark via pip. >> >>>>>>>>>> >> >>>>>>>>>> For now I just symlink the python/pyspark in my python install >> dir >> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to >> work >> properly. >> >>>>>>>>>> I can do the setup.py work or anything. >> >>>>>>>>>> >> >>>>>>>>>> What do you think ? >> >>>>>>>>>> >> >>>>>>>>>> Regards, >> >>>>>>>>>> >> >>>>>>>>>> Olivier. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> >>> >> > >> > >> > >> > -- >> > Brian E. Granger >> > Cal Poly State University, San Luis Obispo >> > @ellisonbg on Twitter and GitHub >> > > bgranger@ > and > ellisonbg@ >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: > dev-unsubscribe@.apache >> For additional commands, e-mail: > dev-help@.apache >> >> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13635.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org