I think so, any contributions on this are welcome. On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <elliso...@gmail.com> wrote: > Sorry, trying to follow the context here. Does it look like there is > support for the idea of creating a setup.py file and pypi package for > pyspark? > > Cheers, > > Brian > > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <dav...@databricks.com> wrote: >> We could do that after 1.5 released, it will have same release cycle >> as Spark in the future. >> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot >> <o.girar...@lateral-thoughts.com> wrote: >>> +1 (once again :) ) >>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.u...@gmail.com>: >>>> >>>> // ping >>>> >>>> do we have any signoff from the pyspark devs to submit a PR to publish to >>>> PyPI? >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <freeman.jer...@gmail.com> >>>> wrote: >>>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in >>>>> steps that make it easier to use PySpark as an ordinary python library. >>>>> >>>>> You might want to check out this (https://github.com/minrk/findspark), >>>>> started by Jupyter project devs, that offers one way to facilitate this >>>>> stuff. I’ve also cced them here to join the conversation. >>>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done >>>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs >>>>> just using `from pyspark import SparkContext; sc = >>>>> SparkContext(master=“X”)` >>>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are >>>>> set correctly on *both* workers and driver. That said, there’s definitely >>>>> additional configuration / functionality that would require going through >>>>> the proper submit scripts. >>>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.bis...@gmail.com> >>>>> wrote: >>>>> >>>>> I agree with everything Justin just said. An additional advantage of >>>>> publishing PySpark's Python code in a standards-compliant way is the fact >>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a >>>>> way that pip can use. Contrast this with the current situation, where >>>>> df.toPandas() exists in the Spark API but doesn't actually work until you >>>>> install Pandas. >>>>> >>>>> Punya >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com> >>>>> wrote: >>>>>> >>>>>> // + Davies for his comments >>>>>> // + Punya for SA >>>>>> >>>>>> For development and CI, like Olivier mentioned, I think it would be >>>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on >>>>>> PyPI. >>>>>> If anyone wants to develop against PySpark APIs, they need to download >>>>>> the >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools >>>>>> (pylint, >>>>>> pytest, IDE code completion). Right now that involves adding python/ and >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH >>>>>> munging in >>>>>> the ./pyspark script. With a proper pyspark setup.py which declares its >>>>>> dependencies, and a published distribution, depending on pyspark will >>>>>> just >>>>>> be adding pyspark to my setup.py dependencies. >>>>>> >>>>>> Of course, if we actually want to run parts of pyspark that is backed by >>>>>> Py4J calls, then we need the full spark distribution with either >>>>>> ./pyspark >>>>>> or ./spark-submit, but for things like linting and development, the >>>>>> PYTHONPATH munging is very annoying. >>>>>> >>>>>> I don't think the version-mismatch issues are a compelling reason to not >>>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce >>>>>> that >>>>>> the version has to be exact, which means there is no backcompat >>>>>> nightmare as >>>>>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. >>>>>> This would mean that even if the user got his pip installed pyspark to >>>>>> somehow get loaded before the spark distribution provided pyspark, then >>>>>> the >>>>>> user would be alerted immediately. >>>>>> >>>>>> Davies, if you buy this, should me or someone on my team pick up >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and >>>>>> https://github.com/apache/spark/pull/464? >>>>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot >>>>>> <o.girar...@lateral-thoughts.com> wrote: >>>>>>> >>>>>>> Ok, I get it. Now what can we do to improve the current situation, >>>>>>> because right now if I want to set-up a CI env for PySpark, I have to : >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere on >>>>>>> every agent >>>>>>> 2- define the SPARK_HOME env >>>>>>> 3- symlink this distribution pyspark dir inside the python install dir >>>>>>> site-packages/ directory >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV >>>>>>> project), I have to (except if I'm mistaken) >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory >>>>>>> on every agent >>>>>>> 5- add this jar-filled directory to the Spark distribution's additional >>>>>>> classpath using the conf/spark-default file >>>>>>> >>>>>>> Then finally we can launch our unit/integration-tests. >>>>>>> Some issues are related to spark-packages, some to the lack of >>>>>>> python-based dependency, and some to the way SparkContext are launched >>>>>>> when >>>>>>> using pyspark. >>>>>>> I think step 1 and 2 are fair enough >>>>>>> 4 and 5 may already have solutions, I didn't check and considering >>>>>>> spark-shell is downloading such dependencies automatically, I think if >>>>>>> nothing's done yet it will (I guess ?). >>>>>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution would be >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark >>>>>>> distribution in PyPi, maybe there's a better compromise ? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Olivier. >>>>>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit >>>>>>> : >>>>>>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just serves >>>>>>>> as a shim to an existing Spark installation? Or it could even download >>>>>>>> the >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. Right >>>>>>>> now, >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For >>>>>>>> example, >>>>>>>> why do I need to use a strange incantation when booting up IPython if >>>>>>>> I want >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be much >>>>>>>> nicer >>>>>>>> to just type `from pyspark import SparkContext; sc = >>>>>>>> SparkContext("local[4]")` in my notebook. >>>>>>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass when >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH: >>>>>>>> >>>>>>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py >>>>>>>> >>>>>>>> -Jey >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> This has been proposed before: >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 >>>>>>>>> >>>>>>>>> There's currently tighter coupling between the Python and Java halves >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, >>>>>>>>> I bet >>>>>>>>> we'd run into tons of issues when users try to run a newer version of >>>>>>>>> the >>>>>>>>> Python half of PySpark against an older set of Java components or >>>>>>>>> vice-versa. >>>>>>>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot >>>>>>>>> <o.girar...@lateral-thoughts.com> wrote: >>>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> Considering the python API as just a front needing the SPARK_HOME >>>>>>>>>> defined anyway, I think it would be interesting to deploy the Python >>>>>>>>>> part of >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python project >>>>>>>>>> needing PySpark via pip. >>>>>>>>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python install dir >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work >>>>>>>>>> properly. >>>>>>>>>> I can do the setup.py work or anything. >>>>>>>>>> >>>>>>>>>> What do you think ? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Olivier. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> >>> > > > > -- > Brian E. Granger > Cal Poly State University, San Luis Obispo > @ellisonbg on Twitter and GitHub > bgran...@calpoly.edu and elliso...@gmail.com
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org