I would tentatively suggest also conda packaging. http://conda.pydata.org/docs/
--Matthew Goodman ===================== Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <dav...@databricks.com> wrote: > I think so, any contributions on this are welcome. > > On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <elliso...@gmail.com> > wrote: > > Sorry, trying to follow the context here. Does it look like there is > > support for the idea of creating a setup.py file and pypi package for > > pyspark? > > > > Cheers, > > > > Brian > > > > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <dav...@databricks.com> > wrote: > >> We could do that after 1.5 released, it will have same release cycle > >> as Spark in the future. > >> > >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot > >> <o.girar...@lateral-thoughts.com> wrote: > >>> +1 (once again :) ) > >>> > >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.u...@gmail.com>: > >>>> > >>>> // ping > >>>> > >>>> do we have any signoff from the pyspark devs to submit a PR to > publish to > >>>> PyPI? > >>>> > >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman < > freeman.jer...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of > value in > >>>>> steps that make it easier to use PySpark as an ordinary python > library. > >>>>> > >>>>> You might want to check out this (https://github.com/minrk/findspark > ), > >>>>> started by Jupyter project devs, that offers one way to facilitate > this > >>>>> stuff. I’ve also cced them here to join the conversation. > >>>>> > >>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve > done > >>>>> it in an EC2 cluster in standalone mode) it’s possible to run > PySpark jobs > >>>>> just using `from pyspark import SparkContext; sc = > SparkContext(master=“X”)` > >>>>> so long as the environmental variables (PYTHONPATH and > PYSPARK_PYTHON) are > >>>>> set correctly on *both* workers and driver. That said, there’s > definitely > >>>>> additional configuration / functionality that would require going > through > >>>>> the proper submit scripts. > >>>>> > >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal < > punya.bis...@gmail.com> > >>>>> wrote: > >>>>> > >>>>> I agree with everything Justin just said. An additional advantage of > >>>>> publishing PySpark's Python code in a standards-compliant way is the > fact > >>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) > in a > >>>>> way that pip can use. Contrast this with the current situation, where > >>>>> df.toPandas() exists in the Spark API but doesn't actually work > until you > >>>>> install Pandas. > >>>>> > >>>>> Punya > >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> // + Davies for his comments > >>>>>> // + Punya for SA > >>>>>> > >>>>>> For development and CI, like Olivier mentioned, I think it would be > >>>>>> hugely beneficial to publish pyspark (only code in the python/ dir) > on PyPI. > >>>>>> If anyone wants to develop against PySpark APIs, they need to > download the > >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools > (pylint, > >>>>>> pytest, IDE code completion). Right now that involves adding > python/ and > >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add > more > >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH > munging in > >>>>>> the ./pyspark script. With a proper pyspark setup.py which declares > its > >>>>>> dependencies, and a published distribution, depending on pyspark > will just > >>>>>> be adding pyspark to my setup.py dependencies. > >>>>>> > >>>>>> Of course, if we actually want to run parts of pyspark that is > backed by > >>>>>> Py4J calls, then we need the full spark distribution with either > ./pyspark > >>>>>> or ./spark-submit, but for things like linting and development, the > >>>>>> PYTHONPATH munging is very annoying. > >>>>>> > >>>>>> I don't think the version-mismatch issues are a compelling reason > to not > >>>>>> go ahead with PyPI publishing. At runtime, we should definitely > enforce that > >>>>>> the version has to be exact, which means there is no backcompat > nightmare as > >>>>>> suggested by Davies in > https://issues.apache.org/jira/browse/SPARK-1267. > >>>>>> This would mean that even if the user got his pip installed pyspark > to > >>>>>> somehow get loaded before the spark distribution provided pyspark, > then the > >>>>>> user would be alerted immediately. > >>>>>> > >>>>>> Davies, if you buy this, should me or someone on my team pick up > >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and > >>>>>> https://github.com/apache/spark/pull/464? > >>>>>> > >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot > >>>>>> <o.girar...@lateral-thoughts.com> wrote: > >>>>>>> > >>>>>>> Ok, I get it. Now what can we do to improve the current situation, > >>>>>>> because right now if I want to set-up a CI env for PySpark, I have > to : > >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere > on > >>>>>>> every agent > >>>>>>> 2- define the SPARK_HOME env > >>>>>>> 3- symlink this distribution pyspark dir inside the python install > dir > >>>>>>> site-packages/ directory > >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV > >>>>>>> project), I have to (except if I'm mistaken) > >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific > directory > >>>>>>> on every agent > >>>>>>> 5- add this jar-filled directory to the Spark distribution's > additional > >>>>>>> classpath using the conf/spark-default file > >>>>>>> > >>>>>>> Then finally we can launch our unit/integration-tests. > >>>>>>> Some issues are related to spark-packages, some to the lack of > >>>>>>> python-based dependency, and some to the way SparkContext are > launched when > >>>>>>> using pyspark. > >>>>>>> I think step 1 and 2 are fair enough > >>>>>>> 4 and 5 may already have solutions, I didn't check and considering > >>>>>>> spark-shell is downloading such dependencies automatically, I > think if > >>>>>>> nothing's done yet it will (I guess ?). > >>>>>>> > >>>>>>> For step 3, maybe just adding a setup.py to the distribution would > be > >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark > >>>>>>> distribution in PyPi, maybe there's a better compromise ? > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Olivier. > >>>>>>> > >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a > écrit > >>>>>>> : > >>>>>>>> > >>>>>>>> Couldn't we have a pip installable "pyspark" package that just > serves > >>>>>>>> as a shim to an existing Spark installation? Or it could even > download the > >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. > Right now, > >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For > example, > >>>>>>>> why do I need to use a strange incantation when booting up > IPython if I want > >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be > much nicer > >>>>>>>> to just type `from pyspark import SparkContext; sc = > >>>>>>>> SparkContext("local[4]")` in my notebook. > >>>>>>>> > >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass > when > >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH: > >>>>>>>> > >>>>>>>> > >>>>>>>> > PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH > >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py > >>>>>>>> > >>>>>>>> -Jey > >>>>>>>> > >>>>>>>> > >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com > > > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> This has been proposed before: > >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 > >>>>>>>>> > >>>>>>>>> There's currently tighter coupling between the Python and Java > halves > >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did > this, I bet > >>>>>>>>> we'd run into tons of issues when users try to run a newer > version of the > >>>>>>>>> Python half of PySpark against an older set of Java components or > >>>>>>>>> vice-versa. > >>>>>>>>> > >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot > >>>>>>>>> <o.girar...@lateral-thoughts.com> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi everyone, > >>>>>>>>>> Considering the python API as just a front needing the > SPARK_HOME > >>>>>>>>>> defined anyway, I think it would be interesting to deploy the > Python part of > >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python > project > >>>>>>>>>> needing PySpark via pip. > >>>>>>>>>> > >>>>>>>>>> For now I just symlink the python/pyspark in my python install > dir > >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work > properly. > >>>>>>>>>> I can do the setup.py work or anything. > >>>>>>>>>> > >>>>>>>>>> What do you think ? > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> > >>>>>>>>>> Olivier. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>> > >>> > > > > > > > > -- > > Brian E. Granger > > Cal Poly State University, San Luis Obispo > > @ellisonbg on Twitter and GitHub > > bgran...@calpoly.edu and elliso...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >