Re: PySpark on PyPi

Davies Liu Mon, 10 Aug 2015 11:25:22 -0700

I think so, any contributions on this are welcome.

On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <elliso...@gmail.com> wrote:
> Sorry, trying to follow the context here. Does it look like there is
> support for the idea of creating a setup.py file and pypi package for
> pyspark?
>
> Cheers,
>
> Brian
>
> On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <dav...@databricks.com> wrote:
>> We could do that after 1.5 released, it will have same release cycle
>> as Spark in the future.
>>
>> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> <o.girar...@lateral-thoughts.com> wrote:
>>> +1 (once again :) )
>>>
>>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.u...@gmail.com>:
>>>>
>>>> // ping
>>>>
>>>> do we have any signoff from the pyspark devs to submit a PR to publish to
>>>> PyPI?
>>>>
>>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <freeman.jer...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>>>> steps that make it easier to use PySpark as an ordinary python library.
>>>>>
>>>>> You might want to check out this (https://github.com/minrk/findspark),
>>>>> started by Jupyter project devs, that offers one way to facilitate this
>>>>> stuff. I’ve also cced them here to join the conversation.
>>>>>
>>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>>>> just using `from pyspark import SparkContext; sc = 
>>>>> SparkContext(master=“X”)`
>>>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>>>> set correctly on *both* workers and driver. That said, there’s definitely
>>>>> additional configuration / functionality that would require going through
>>>>> the proper submit scripts.
>>>>>
>>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.bis...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I agree with everything Justin just said. An additional advantage of
>>>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>>>> way that pip can use. Contrast this with the current situation, where
>>>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>>>> install Pandas.
>>>>>
>>>>> Punya
>>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> // + Davies for his comments
>>>>>> // + Punya for SA
>>>>>>
>>>>>> For development and CI, like Olivier mentioned, I think it would be
>>>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on 
>>>>>> PyPI.
>>>>>> If anyone wants to develop against PySpark APIs, they need to download 
>>>>>> the
>>>>>> distribution and do a lot of PYTHONPATH munging for all the tools 
>>>>>> (pylint,
>>>>>> pytest, IDE code completion). Right now that involves adding python/ and
>>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>>>>>> dependencies, we would have to manually mirror all the PYTHONPATH 
>>>>>> munging in
>>>>>> the ./pyspark script. With a proper pyspark setup.py which declares its
>>>>>> dependencies, and a published distribution, depending on pyspark will 
>>>>>> just
>>>>>> be adding pyspark to my setup.py dependencies.
>>>>>>
>>>>>> Of course, if we actually want to run parts of pyspark that is backed by
>>>>>> Py4J calls, then we need the full spark distribution with either 
>>>>>> ./pyspark
>>>>>> or ./spark-submit, but for things like linting and development, the
>>>>>> PYTHONPATH munging is very annoying.
>>>>>>
>>>>>> I don't think the version-mismatch issues are a compelling reason to not
>>>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce 
>>>>>> that
>>>>>> the version has to be exact, which means there is no backcompat 
>>>>>> nightmare as
>>>>>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
>>>>>> This would mean that even if the user got his pip installed pyspark to
>>>>>> somehow get loaded before the spark distribution provided pyspark, then 
>>>>>> the
>>>>>> user would be alerted immediately.
>>>>>>
>>>>>> Davies, if you buy this, should me or someone on my team pick up
>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>>>> https://github.com/apache/spark/pull/464?
>>>>>>
>>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>>>> <o.girar...@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>> Ok, I get it. Now what can we do to improve the current situation,
>>>>>>> because right now if I want to set-up a CI env for PySpark, I have to :
>>>>>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>>>>>> every agent
>>>>>>> 2- define the SPARK_HOME env
>>>>>>> 3- symlink this distribution pyspark dir inside the python install dir
>>>>>>> site-packages/ directory
>>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>>>>> project), I have to (except if I'm mistaken)
>>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory
>>>>>>> on every agent
>>>>>>> 5- add this jar-filled directory to the Spark distribution's additional
>>>>>>> classpath using the conf/spark-default file
>>>>>>>
>>>>>>> Then finally we can launch our unit/integration-tests.
>>>>>>> Some issues are related to spark-packages, some to the lack of
>>>>>>> python-based dependency, and some to the way SparkContext are launched 
>>>>>>> when
>>>>>>> using pyspark.
>>>>>>> I think step 1 and 2 are fair enough
>>>>>>> 4 and 5 may already have solutions, I didn't check and considering
>>>>>>> spark-shell is downloading such dependencies automatically, I think if
>>>>>>> nothing's done yet it will (I guess ?).
>>>>>>>
>>>>>>> For step 3, maybe just adding a setup.py to the distribution would be
>>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Olivier.
>>>>>>>
>>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit
>>>>>>> :
>>>>>>>>
>>>>>>>> Couldn't we have a pip installable "pyspark" package that just serves
>>>>>>>> as a shim to an existing Spark installation? Or it could even download 
>>>>>>>> the
>>>>>>>> latest Spark binary if SPARK_HOME isn't set during installation. Right 
>>>>>>>> now,
>>>>>>>> Spark doesn't play very well with the usual Python ecosystem. For 
>>>>>>>> example,
>>>>>>>> why do I need to use a strange incantation when booting up IPython if 
>>>>>>>> I want
>>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be much 
>>>>>>>> nicer
>>>>>>>> to just type `from pyspark import SparkContext; sc =
>>>>>>>> SparkContext("local[4]")` in my notebook.
>>>>>>>>
>>>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass when
>>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>>>>>>
>>>>>>>>
>>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>>>>>>
>>>>>>>> -Jey
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> This has been proposed before:
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>>>>>>
>>>>>>>>> There's currently tighter coupling between the Python and Java halves
>>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, 
>>>>>>>>> I bet
>>>>>>>>> we'd run into tons of issues when users try to run a newer version of 
>>>>>>>>> the
>>>>>>>>> Python half of PySpark against an older set of Java components or
>>>>>>>>> vice-versa.
>>>>>>>>>
>>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>>>>>>> <o.girar...@lateral-thoughts.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>>>>>>> defined anyway, I think it would be interesting to deploy the Python 
>>>>>>>>>> part of
>>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python project
>>>>>>>>>> needing PySpark via pip.
>>>>>>>>>>
>>>>>>>>>> For now I just symlink the python/pyspark in my python install dir
>>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to work 
>>>>>>>>>> properly.
>>>>>>>>>> I can do the setup.py work or anything.
>>>>>>>>>>
>>>>>>>>>> What do you think ?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Olivier.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>
>
>
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> @ellisonbg on Twitter and GitHub
> bgran...@calpoly.edu and elliso...@gmail.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: PySpark on PyPi

Reply via email to