Re: PySpark on PyPi

Matt Goodman Mon, 10 Aug 2015 12:29:36 -0700

I would tentatively suggest also conda packaging.

http://conda.pydata.org/docs/


--Matthew Goodman

=====================
Check Out My Website: http://craneium.net
Find me on LinkedIn: http://tinyurl.com/d6wlch

On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <[email protected]> wrote:

> I think so, any contributions on this are welcome.
>
> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <[email protected]>
> wrote:
> > Sorry, trying to follow the context here. Does it look like there is
> > support for the idea of creating a setup.py file and pypi package for
> > pyspark?
> >
> > Cheers,
> >
> > Brian
> >
> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <[email protected]>
> wrote:
> >> We could do that after 1.5 released, it will have same release cycle
> >> as Spark in the future.
> >>
> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
> >> <[email protected]> wrote:
> >>> +1 (once again :) )
> >>>
> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <[email protected]>:
> >>>>
> >>>> // ping
> >>>>
> >>>> do we have any signoff from the pyspark devs to submit a PR to
> publish to
> >>>> PyPI?
> >>>>
> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
> [email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
> value in
> >>>>> steps that make it easier to use PySpark as an ordinary python
> library.
> >>>>>
> >>>>> You might want to check out this (https://github.com/minrk/findspark
> ),
> >>>>> started by Jupyter project devs, that offers one way to facilitate
> this
> >>>>> stuff. I’ve also cced them here to join the conversation.
> >>>>>
> >>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve
> done
> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
> PySpark jobs
> >>>>> just using `from pyspark import SparkContext; sc =
> SparkContext(master=“X”)`
> >>>>> so long as the environmental variables (PYTHONPATH and
> PYSPARK_PYTHON) are
> >>>>> set correctly on *both* workers and driver. That said, there’s
> definitely
> >>>>> additional configuration / functionality that would require going
> through
> >>>>> the proper submit scripts.
> >>>>>
> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>> I agree with everything Justin just said. An additional advantage of
> >>>>> publishing PySpark's Python code in a standards-compliant way is the
> fact
> >>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J)
> in a
> >>>>> way that pip can use. Contrast this with the current situation, where
> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
> until you
> >>>>> install Pandas.
> >>>>>
> >>>>> Punya
> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>> // + Davies for his comments
> >>>>>> // + Punya for SA
> >>>>>>
> >>>>>> For development and CI, like Olivier mentioned, I think it would be
> >>>>>> hugely beneficial to publish pyspark (only code in the python/ dir)
> on PyPI.
> >>>>>> If anyone wants to develop against PySpark APIs, they need to
> download the
> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
> (pylint,
> >>>>>> pytest, IDE code completion). Right now that involves adding
> python/ and
> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
> more
> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
> munging in
> >>>>>> the ./pyspark script. With a proper pyspark setup.py which declares
> its
> >>>>>> dependencies, and a published distribution, depending on pyspark
> will just
> >>>>>> be adding pyspark to my setup.py dependencies.
> >>>>>>
> >>>>>> Of course, if we actually want to run parts of pyspark that is
> backed by
> >>>>>> Py4J calls, then we need the full spark distribution with either
> ./pyspark
> >>>>>> or ./spark-submit, but for things like linting and development, the
> >>>>>> PYTHONPATH munging is very annoying.
> >>>>>>
> >>>>>> I don't think the version-mismatch issues are a compelling reason
> to not
> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
> enforce that
> >>>>>> the version has to be exact, which means there is no backcompat
> nightmare as
> >>>>>> suggested by Davies in
> https://issues.apache.org/jira/browse/SPARK-1267.
> >>>>>> This would mean that even if the user got his pip installed pyspark
> to
> >>>>>> somehow get loaded before the spark distribution provided pyspark,
> then the
> >>>>>> user would be alerted immediately.
> >>>>>>
> >>>>>> Davies, if you buy this, should me or someone on my team pick up
> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
> >>>>>> https://github.com/apache/spark/pull/464?
> >>>>>>
> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
> >>>>>> <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Ok, I get it. Now what can we do to improve the current situation,
> >>>>>>> because right now if I want to set-up a CI env for PySpark, I have
> to :
> >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere
> on
> >>>>>>> every agent
> >>>>>>> 2- define the SPARK_HOME env
> >>>>>>> 3- symlink this distribution pyspark dir inside the python install
> dir
> >>>>>>> site-packages/ directory
> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
> >>>>>>> project), I have to (except if I'm mistaken)
> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
> directory
> >>>>>>> on every agent
> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
> additional
> >>>>>>> classpath using the conf/spark-default file
> >>>>>>>
> >>>>>>> Then finally we can launch our unit/integration-tests.
> >>>>>>> Some issues are related to spark-packages, some to the lack of
> >>>>>>> python-based dependency, and some to the way SparkContext are
> launched when
> >>>>>>> using pyspark.
> >>>>>>> I think step 1 and 2 are fair enough
> >>>>>>> 4 and 5 may already have solutions, I didn't check and considering
> >>>>>>> spark-shell is downloading such dependencies automatically, I
> think if
> >>>>>>> nothing's done yet it will (I guess ?).
> >>>>>>>
> >>>>>>> For step 3, maybe just adding a setup.py to the distribution would
> be
> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> Olivier.
> >>>>>>>
> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <[email protected]> a
> écrit
> >>>>>>> :
> >>>>>>>>
> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
> serves
> >>>>>>>> as a shim to an existing Spark installation? Or it could even
> download the
> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation.
> Right now,
> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For
> example,
> >>>>>>>> why do I need to use a strange incantation when booting up
> IPython if I want
> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be
> much nicer
> >>>>>>>> to just type `from pyspark import SparkContext; sc =
> >>>>>>>> SparkContext("local[4]")` in my notebook.
> >>>>>>>>
> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass
> when
> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
> >>>>>>>>
> >>>>>>>> -Jey
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <[email protected]
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> This has been proposed before:
> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
> >>>>>>>>>
> >>>>>>>>> There's currently tighter coupling between the Python and Java
> halves
> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
> this, I bet
> >>>>>>>>> we'd run into tons of issues when users try to run a newer
> version of the
> >>>>>>>>> Python half of PySpark against an older set of Java components or
> >>>>>>>>> vice-versa.
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi everyone,
> >>>>>>>>>> Considering the python API as just a front needing the
> SPARK_HOME
> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
> Python part of
> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
> project
> >>>>>>>>>> needing PySpark via pip.
> >>>>>>>>>>
> >>>>>>>>>> For now I just symlink the python/pyspark in my python install
> dir
> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work
> properly.
> >>>>>>>>>> I can do the setup.py work or anything.
> >>>>>>>>>>
> >>>>>>>>>> What do you think ?
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>>
> >>>>>>>>>> Olivier.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>
> >
> >
> >
> > --
> > Brian E. Granger
> > Cal Poly State University, San Luis Obispo
> > @ellisonbg on Twitter and GitHub
> > [email protected] and [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: PySpark on PyPi

Reply via email to