Re: PySpark on PyPi

westurner Tue, 11 Aug 2015 13:55:59 -0700

Matt Goodman wrote
> I would tentatively suggest also conda packaging.
> 
> http://conda.pydata.org/docs/


$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])

Docs for building conda packages for multiple operating systems and
interpreters from PyPi packages:

*
http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
* https://github.com/audreyr/cookiecutter/issues/232


Matt Goodman wrote
> --Matthew Goodman
> 
> =====================
> Check Out My Website: http://craneium.net
> Find me on LinkedIn: http://tinyurl.com/d6wlch
> 
> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu &lt;

> davies@

> &gt; wrote:
> 
>> I think so, any contributions on this are welcome.
>>
>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger &lt;

> ellisonbg@

> &gt;
>> wrote:
>> > Sorry, trying to follow the context here. Does it look like there is
>> > support for the idea of creating a setup.py file and pypi package for
>> > pyspark?
>> >
>> > Cheers,
>> >
>> > Brian
>> >
>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu &lt;

> davies@

> &gt;
>> wrote:
>> >> We could do that after 1.5 released, it will have same release cycle
>> >> as Spark in the future.
>> >>
>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >> &lt;

> o.girardot@

> &gt; wrote:
>> >>> +1 (once again :) )
>> >>>
>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang &lt;

> justin.uang@

> &gt;:
>> >>>>
>> >>>> // ping
>> >>>>
>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>> publish to
>> >>>> PyPI?
>> >>>>
>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>> 

> freeman.jeremy@

>>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
>> value in
>> >>>>> steps that make it easier to use PySpark as an ordinary python
>> library.
>> >>>>>
>> >>>>> You might want to check out this
>> (https://github.com/minrk/findspark
>> ),
>> >>>>> started by Jupyter project devs, that offers one way to facilitate
>> this
>> >>>>> stuff. I’ve also cced them here to join the conversation.
>> >>>>>
>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>> (I’ve
>> done
>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>> PySpark jobs
>> >>>>> just using `from pyspark import SparkContext; sc =
>> SparkContext(master=“X”)`
>> >>>>> so long as the environmental variables (PYTHONPATH and
>> PYSPARK_PYTHON) are
>> >>>>> set correctly on *both* workers and driver. That said, there’s
>> definitely
>> >>>>> additional configuration / functionality that would require going
>> through
>> >>>>> the proper submit scripts.
>> >>>>>
>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>> 

> punya.biswal@

>>
>> >>>>> wrote:
>> >>>>>
>> >>>>> I agree with everything Justin just said. An additional advantage
>> of
>> >>>>> publishing PySpark's Python code in a standards-compliant way is
>> the
>> fact
>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>> Py4J)
>> in a
>> >>>>> way that pip can use. Contrast this with the current situation,
>> where
>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>> until you
>> >>>>> install Pandas.
>> >>>>>
>> >>>>> Punya
>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang &lt;

> justin.uang@

> &gt;
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> // + Davies for his comments
>> >>>>>> // + Punya for SA
>> >>>>>>
>> >>>>>> For development and CI, like Olivier mentioned, I think it would
>> be
>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>> dir)
>> on PyPI.
>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>> download the
>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
>> (pylint,
>> >>>>>> pytest, IDE code completion). Right now that involves adding
>> python/ and
>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
>> more
>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
>> munging in
>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>> declares
>> its
>> >>>>>> dependencies, and a published distribution, depending on pyspark
>> will just
>> >>>>>> be adding pyspark to my setup.py dependencies.
>> >>>>>>
>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>> backed by
>> >>>>>> Py4J calls, then we need the full spark distribution with either
>> ./pyspark
>> >>>>>> or ./spark-submit, but for things like linting and development,
>> the
>> >>>>>> PYTHONPATH munging is very annoying.
>> >>>>>>
>> >>>>>> I don't think the version-mismatch issues are a compelling reason
>> to not
>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
>> enforce that
>> >>>>>> the version has to be exact, which means there is no backcompat
>> nightmare as
>> >>>>>> suggested by Davies in
>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >>>>>> This would mean that even if the user got his pip installed
>> pyspark
>> to
>> >>>>>> somehow get loaded before the spark distribution provided pyspark,
>> then the
>> >>>>>> user would be alerted immediately.
>> >>>>>>
>> >>>>>> Davies, if you buy this, should me or someone on my team pick up
>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> >>>>>> https://github.com/apache/spark/pull/464?
>> >>>>>>
>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>> >>>>>> &lt;

> o.girardot@

> &gt; wrote:
>> >>>>>>>
>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>> situation,
>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>> have
>> to :
>> >>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere
>> on
>> >>>>>>> every agent
>> >>>>>>> 2- define the SPARK_HOME env
>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>> install
>> dir
>> >>>>>>> site-packages/ directory
>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>> >>>>>>> project), I have to (except if I'm mistaken)
>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>> directory
>> >>>>>>> on every agent
>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>> additional
>> >>>>>>> classpath using the conf/spark-default file
>> >>>>>>>
>> >>>>>>> Then finally we can launch our unit/integration-tests.
>> >>>>>>> Some issues are related to spark-packages, some to the lack of
>> >>>>>>> python-based dependency, and some to the way SparkContext are
>> launched when
>> >>>>>>> using pyspark.
>> >>>>>>> I think step 1 and 2 are fair enough
>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>> considering
>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>> think if
>> >>>>>>> nothing's done yet it will (I guess ?).
>> >>>>>>>
>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>> would
>> be
>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>> spark
>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>>
>> >>>>>>> Olivier.
>> >>>>>>>
>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam &lt;

> [email protected]

> &gt; a
>> écrit
>> >>>>>>> :
>> >>>>>>>>
>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
>> serves
>> >>>>>>>> as a shim to an existing Spark installation? Or it could even
>> download the
>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation.
>> Right now,
>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem.
>> For
>> example,
>> >>>>>>>> why do I need to use a strange incantation when booting up
>> IPython if I want
>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be
>> much nicer
>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>> >>>>>>>>
>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>> pass
>> when
>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >>>>>>>>
>> >>>>>>>> -Jey
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen &lt;

> rosenville@

> &gt; >
>> >>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> This has been proposed before:
>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >>>>>>>>>
>> >>>>>>>>> There's currently tighter coupling between the Python and Java
>> halves
>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
>> this, I bet
>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>> version of the
>> >>>>>>>>> Python half of PySpark against an older set of Java components
>> or
>> >>>>>>>>> vice-versa.
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>> >>>>>>>>> &lt;

> o.girardot@

> &gt; wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi everyone,
>> >>>>>>>>>> Considering the python API as just a front needing the
>> SPARK_HOME
>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
>> Python part of
>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
>> project
>> >>>>>>>>>> needing PySpark via pip.
>> >>>>>>>>>>
>> >>>>>>>>>> For now I just symlink the python/pyspark in my python install
>> dir
>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>> work
>> properly.
>> >>>>>>>>>> I can do the setup.py work or anything.
>> >>>>>>>>>>
>> >>>>>>>>>> What do you think ?
>> >>>>>>>>>>
>> >>>>>>>>>> Regards,
>> >>>>>>>>>>
>> >>>>>>>>>> Olivier.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>
>> >>>
>> >
>> >
>> >
>> > --
>> > Brian E. Granger
>> > Cal Poly State University, San Luis Obispo
>> > @ellisonbg on Twitter and GitHub
>> > 

> bgranger@

>  and 

> ellisonbg@

>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> [email protected]

>> For additional commands, e-mail: 

> [email protected]

>>
>>





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13635.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PySpark on PyPi

Reply via email to