Re: PySpark on PyPi

Brian Granger Thu, 20 Aug 2015 13:45:14 -0700

I would start with just the plain python package without the JAR and
then see if it makes sense to add the JAR over time.


On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez <[email protected]> wrote:
> Hi all,
>
> I wanted to bubble up a conversation from the PR to this discussion to see
> if there is support the idea of including a Spark assembly JAR in a PyPI
> release of pyspark. @holdenk recommended this as she already does so in the
> Sparkling Pandas package. Is this something people are interesting in
> pursuing?
>
> -Auberon
>
> On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger <[email protected]> wrote:
>>
>> Auberon, can you also post this to the Jupyter Google Group?
>>
>> On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <[email protected]>
>> wrote:
>> > Hi all,
>> >
>> > I've created an updated PR for this based off of the previous work of
>> > @prabinb:
>> > https://github.com/apache/spark/pull/8318
>> >
>> > I am not very familiar with python packaging; feedback is appreciated.
>> >
>> > -Auberon
>> >
>> > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <[email protected]> wrote:
>> >>
>> >>
>> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <[email protected]>
>> >> wrote:
>> >>>
>> >>> I would tentatively suggest also conda packaging.
>> >>
>> >>
>> >> A conda package has the advantage that it can be set up without
>> >> 'installing' the pyspark files, while the PyPI packaging is still being
>> >> worked out. It can just add a pyspark.pth file pointing to pyspark,
>> >> py4j
>> >> locations. But I think it's a really good idea to package with conda.
>> >>
>> >> -MinRK
>> >>
>> >>>
>> >>>
>> >>> http://conda.pydata.org/docs/
>> >>>
>> >>> --Matthew Goodman
>> >>>
>> >>> =====================
>> >>> Check Out My Website: http://craneium.net
>> >>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> >>>
>> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> I think so, any contributions on this are welcome.
>> >>>>
>> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <[email protected]>
>> >>>> wrote:
>> >>>> > Sorry, trying to follow the context here. Does it look like there
>> >>>> > is
>> >>>> > support for the idea of creating a setup.py file and pypi package
>> >>>> > for
>> >>>> > pyspark?
>> >>>> >
>> >>>> > Cheers,
>> >>>> >
>> >>>> > Brian
>> >>>> >
>> >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <[email protected]>
>> >>>> > wrote:
>> >>>> >> We could do that after 1.5 released, it will have same release
>> >>>> >> cycle
>> >>>> >> as Spark in the future.
>> >>>> >>
>> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >>>> >> <[email protected]> wrote:
>> >>>> >>> +1 (once again :) )
>> >>>> >>>
>> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <[email protected]>:
>> >>>> >>>>
>> >>>> >>>> // ping
>> >>>> >>>>
>> >>>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>> >>>> >>>> publish to
>> >>>> >>>> PyPI?
>> >>>> >>>>
>> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>> >>>> >>>> <[email protected]>
>> >>>> >>>> wrote:
>> >>>> >>>>>
>> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot
>> >>>> >>>>> of
>> >>>> >>>>> value in
>> >>>> >>>>> steps that make it easier to use PySpark as an ordinary python
>> >>>> >>>>> library.
>> >>>> >>>>>
>> >>>> >>>>> You might want to check out this
>> >>>> >>>>> (https://github.com/minrk/findspark),
>> >>>> >>>>> started by Jupyter project devs, that offers one way to
>> >>>> >>>>> facilitate
>> >>>> >>>>> this
>> >>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>> >>>> >>>>>
>> >>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>> >>>> >>>>> (I’ve done
>> >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>> >>>> >>>>> PySpark jobs
>> >>>> >>>>> just using `from pyspark import SparkContext; sc =
>> >>>> >>>>> SparkContext(master=“X”)`
>> >>>> >>>>> so long as the environmental variables (PYTHONPATH and
>> >>>> >>>>> PYSPARK_PYTHON) are
>> >>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>> >>>> >>>>> definitely
>> >>>> >>>>> additional configuration / functionality that would require
>> >>>> >>>>> going
>> >>>> >>>>> through
>> >>>> >>>>> the proper submit scripts.
>> >>>> >>>>>
>> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>> >>>> >>>>> <[email protected]>
>> >>>> >>>>> wrote:
>> >>>> >>>>>
>> >>>> >>>>> I agree with everything Justin just said. An additional
>> >>>> >>>>> advantage
>> >>>> >>>>> of
>> >>>> >>>>> publishing PySpark's Python code in a standards-compliant way
>> >>>> >>>>> is
>> >>>> >>>>> the fact
>> >>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>> >>>> >>>>> Py4J) in a
>> >>>> >>>>> way that pip can use. Contrast this with the current situation,
>> >>>> >>>>> where
>> >>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>> >>>> >>>>> until you
>> >>>> >>>>> install Pandas.
>> >>>> >>>>>
>> >>>> >>>>> Punya
>> >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>> >>>> >>>>> <[email protected]>
>> >>>> >>>>> wrote:
>> >>>> >>>>>>
>> >>>> >>>>>> // + Davies for his comments
>> >>>> >>>>>> // + Punya for SA
>> >>>> >>>>>>
>> >>>> >>>>>> For development and CI, like Olivier mentioned, I think it
>> >>>> >>>>>> would
>> >>>> >>>>>> be
>> >>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>> >>>> >>>>>> dir) on PyPI.
>> >>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>> >>>> >>>>>> download the
>> >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the
>> >>>> >>>>>> tools
>> >>>> >>>>>> (pylint,
>> >>>> >>>>>> pytest, IDE code completion). Right now that involves adding
>> >>>> >>>>>> python/ and
>> >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>> >>>> >>>>>> add more
>> >>>> >>>>>> dependencies, we would have to manually mirror all the
>> >>>> >>>>>> PYTHONPATH
>> >>>> >>>>>> munging in
>> >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>> >>>> >>>>>> declares its
>> >>>> >>>>>> dependencies, and a published distribution, depending on
>> >>>> >>>>>> pyspark
>> >>>> >>>>>> will just
>> >>>> >>>>>> be adding pyspark to my setup.py dependencies.
>> >>>> >>>>>>
>> >>>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>> >>>> >>>>>> backed by
>> >>>> >>>>>> Py4J calls, then we need the full spark distribution with
>> >>>> >>>>>> either
>> >>>> >>>>>> ./pyspark
>> >>>> >>>>>> or ./spark-submit, but for things like linting and
>> >>>> >>>>>> development,
>> >>>> >>>>>> the
>> >>>> >>>>>> PYTHONPATH munging is very annoying.
>> >>>> >>>>>>
>> >>>> >>>>>> I don't think the version-mismatch issues are a compelling
>> >>>> >>>>>> reason
>> >>>> >>>>>> to not
>> >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
>> >>>> >>>>>> definitely
>> >>>> >>>>>> enforce that
>> >>>> >>>>>> the version has to be exact, which means there is no
>> >>>> >>>>>> backcompat
>> >>>> >>>>>> nightmare as
>> >>>> >>>>>> suggested by Davies in
>> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >>>> >>>>>> This would mean that even if the user got his pip installed
>> >>>> >>>>>> pyspark to
>> >>>> >>>>>> somehow get loaded before the spark distribution provided
>> >>>> >>>>>> pyspark, then the
>> >>>> >>>>>> user would be alerted immediately.
>> >>>> >>>>>>
>> >>>> >>>>>> Davies, if you buy this, should me or someone on my team pick
>> >>>> >>>>>> up
>> >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> >>>> >>>>>> https://github.com/apache/spark/pull/464?
>> >>>> >>>>>>
>> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>> >>>> >>>>>> <[email protected]> wrote:
>> >>>> >>>>>>>
>> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>> >>>> >>>>>>> situation,
>> >>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>> >>>> >>>>>>> have to :
>> >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>> >>>> >>>>>>> somewhere on
>> >>>> >>>>>>> every agent
>> >>>> >>>>>>> 2- define the SPARK_HOME env
>> >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>> >>>> >>>>>>> install dir
>> >>>> >>>>>>> site-packages/ directory
>> >>>> >>>>>>> and if I rely on additional packages (like databricks'
>> >>>> >>>>>>> Spark-CSV
>> >>>> >>>>>>> project), I have to (except if I'm mistaken)
>> >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>> >>>> >>>>>>> directory
>> >>>> >>>>>>> on every agent
>> >>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>> >>>> >>>>>>> additional
>> >>>> >>>>>>> classpath using the conf/spark-default file
>> >>>> >>>>>>>
>> >>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>> >>>> >>>>>>> Some issues are related to spark-packages, some to the lack
>> >>>> >>>>>>> of
>> >>>> >>>>>>> python-based dependency, and some to the way SparkContext are
>> >>>> >>>>>>> launched when
>> >>>> >>>>>>> using pyspark.
>> >>>> >>>>>>> I think step 1 and 2 are fair enough
>> >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>> >>>> >>>>>>> considering
>> >>>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>> >>>> >>>>>>> think if
>> >>>> >>>>>>> nothing's done yet it will (I guess ?).
>> >>>> >>>>>>>
>> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>> >>>> >>>>>>> would be
>> >>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>> >>>> >>>>>>> spark
>> >>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>> >>>> >>>>>>>
>> >>>> >>>>>>> Regards,
>> >>>> >>>>>>>
>> >>>> >>>>>>> Olivier.
>> >>>> >>>>>>>
>> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam
>> >>>> >>>>>>> <[email protected]>
>> >>>> >>>>>>> a écrit
>> >>>> >>>>>>> :
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that
>> >>>> >>>>>>>> just
>> >>>> >>>>>>>> serves
>> >>>> >>>>>>>> as a shim to an existing Spark installation? Or it could
>> >>>> >>>>>>>> even
>> >>>> >>>>>>>> download the
>> >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>> >>>> >>>>>>>> installation. Right now,
>> >>>> >>>>>>>> Spark doesn't play very well with the usual Python
>> >>>> >>>>>>>> ecosystem.
>> >>>> >>>>>>>> For example,
>> >>>> >>>>>>>> why do I need to use a strange incantation when booting up
>> >>>> >>>>>>>> IPython if I want
>> >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It
>> >>>> >>>>>>>> would
>> >>>> >>>>>>>> be much nicer
>> >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>> >>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>> >>>> >>>>>>>> pass when
>> >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>> >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> -Jey
>> >>>> >>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>> >>>> >>>>>>>> <[email protected]>
>> >>>> >>>>>>>> wrote:
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> This has been proposed before:
>> >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> There's currently tighter coupling between the Python and
>> >>>> >>>>>>>>> Java
>> >>>> >>>>>>>>> halves
>> >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we
>> >>>> >>>>>>>>> did
>> >>>> >>>>>>>>> this, I bet
>> >>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>> >>>> >>>>>>>>> version of the
>> >>>> >>>>>>>>> Python half of PySpark against an older set of Java
>> >>>> >>>>>>>>> components
>> >>>> >>>>>>>>> or
>> >>>> >>>>>>>>> vice-versa.
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>> >>>> >>>>>>>>> <[email protected]> wrote:
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Hi everyone,
>> >>>> >>>>>>>>>> Considering the python API as just a front needing the
>> >>>> >>>>>>>>>> SPARK_HOME
>> >>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy
>> >>>> >>>>>>>>>> the
>> >>>> >>>>>>>>>> Python part of
>> >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a
>> >>>> >>>>>>>>>> Python
>> >>>> >>>>>>>>>> project
>> >>>> >>>>>>>>>> needing PySpark via pip.
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>> >>>> >>>>>>>>>> install dir
>> >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>> >>>> >>>>>>>>>> work properly.
>> >>>> >>>>>>>>>> I can do the setup.py work or anything.
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> What do you think ?
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Regards,
>> >>>> >>>>>>>>>>
>> >>>> >>>>>>>>>> Olivier.
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>>
>> >>>> >>>>>>>>
>> >>>> >>>>>
>> >>>> >>>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Brian E. Granger
>> >>>> > Cal Poly State University, San Luis Obispo
>> >>>> > @ellisonbg on Twitter and GitHub
>> >>>> > [email protected] and [email protected]
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: [email protected]
>> >>>> For additional commands, e-mail: [email protected]
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Associate Professor of Physics and Data Science
>> Cal Poly State University, San Luis Obispo
>> @ellisonbg on Twitter and GitHub
>> [email protected] and [email protected]
>
>



-- 
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PySpark on PyPi

Reply via email to