Re: PySpark on PyPi

2015-08-20 Thread Brian Granger
Auberon, can you also post this to the Jupyter Google Group?

On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com wrote:
 Hi all,

 I've created an updated PR for this based off of the previous work of
 @prabinb:
 https://github.com/apache/spark/pull/8318

 I am not very familiar with python packaging; feedback is appreciated.

 -Auberon

 On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote:


 On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com wrote:

 I would tentatively suggest also conda packaging.


 A conda package has the advantage that it can be set up without
 'installing' the pyspark files, while the PyPI packaging is still being
 worked out. It can just add a pyspark.pth file pointing to pyspark, py4j
 locations. But I think it's a really good idea to package with conda.

 -MinRK



 http://conda.pydata.org/docs/

 --Matthew Goodman

 =
 Check Out My Website: http://craneium.net
 Find me on LinkedIn: http://tinyurl.com/d6wlch

 On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com
 wrote:

 I think so, any contributions on this are welcome.

 On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com
 wrote:
  Sorry, trying to follow the context here. Does it look like there is
  support for the idea of creating a setup.py file and pypi package for
  pyspark?
 
  Cheers,
 
  Brian
 
  On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com
  wrote:
  We could do that after 1.5 released, it will have same release cycle
  as Spark in the future.
 
  On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
  o.girar...@lateral-thoughts.com wrote:
  +1 (once again :) )
 
  2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:
 
  // ping
 
  do we have any signoff from the pyspark devs to submit a PR to
  publish to
  PyPI?
 
  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
  freeman.jer...@gmail.com
  wrote:
 
  Hey all, great discussion, just wanted to +1 that I see a lot of
  value in
  steps that make it easier to use PySpark as an ordinary python
  library.
 
  You might want to check out this
  (https://github.com/minrk/findspark),
  started by Jupyter project devs, that offers one way to facilitate
  this
  stuff. I’ve also cced them here to join the conversation.
 
  Also, @Jey, I can also confirm that at least in some scenarios
  (I’ve done
  it in an EC2 cluster in standalone mode) it’s possible to run
  PySpark jobs
  just using `from pyspark import SparkContext; sc =
  SparkContext(master=“X”)`
  so long as the environmental variables (PYTHONPATH and
  PYSPARK_PYTHON) are
  set correctly on *both* workers and driver. That said, there’s
  definitely
  additional configuration / functionality that would require going
  through
  the proper submit scripts.
 
  On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
  punya.bis...@gmail.com
  wrote:
 
  I agree with everything Justin just said. An additional advantage
  of
  publishing PySpark's Python code in a standards-compliant way is
  the fact
  that we'll be able to declare transitive dependencies (Pandas,
  Py4J) in a
  way that pip can use. Contrast this with the current situation,
  where
  df.toPandas() exists in the Spark API but doesn't actually work
  until you
  install Pandas.
 
  Punya
  On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
  justin.u...@gmail.com
  wrote:
 
  // + Davies for his comments
  // + Punya for SA
 
  For development and CI, like Olivier mentioned, I think it would
  be
  hugely beneficial to publish pyspark (only code in the python/
  dir) on PyPI.
  If anyone wants to develop against PySpark APIs, they need to
  download the
  distribution and do a lot of PYTHONPATH munging for all the tools
  (pylint,
  pytest, IDE code completion). Right now that involves adding
  python/ and
  python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
  add more
  dependencies, we would have to manually mirror all the PYTHONPATH
  munging in
  the ./pyspark script. With a proper pyspark setup.py which
  declares its
  dependencies, and a published distribution, depending on pyspark
  will just
  be adding pyspark to my setup.py dependencies.
 
  Of course, if we actually want to run parts of pyspark that is
  backed by
  Py4J calls, then we need the full spark distribution with either
  ./pyspark
  or ./spark-submit, but for things like linting and development,
  the
  PYTHONPATH munging is very annoying.
 
  I don't think the version-mismatch issues are a compelling reason
  to not
  go ahead with PyPI publishing. At runtime, we should definitely
  enforce that
  the version has to be exact, which means there is no backcompat
  nightmare as
  suggested by Davies in
  https://issues.apache.org/jira/browse/SPARK-1267.
  This would mean that even if the user got his pip installed
  pyspark to
  somehow get loaded before the spark distribution provided
  pyspark, then the
  user would be alerted immediately.
 
  Davies, if 

Re: PySpark on PyPi

2015-08-20 Thread westurner
On Aug 20, 2015 4:57 PM, Justin Uang [via Apache Spark Developers List] 
ml-node+s1001551n13766...@n3.nabble.com wrote:

 One other question: Do we have consensus on publishing the
pip-installable source distribution to PyPI? If so, is that something that
the maintainers need to add to the process that they use to publish
releases?

A setup.py, Travis.yml, tox.ini (e.g cookiecutter)?
https://github.com/audreyr/cookiecutter-pypackage

https://wrdrd.com/docs/tools/#python-packages

* scripts=[]
* package_data / MANIFEST.in
* entry_points
   * console_scripts
   *
https://pythonhosted.org/setuptools/setuptools.html#eggsecutable-scripts

https://github.com/audreyr/cookiecutter-pypackage

... https://wrdrd.com/docs/consulting/knowledge-engineering#spark


 On Thu, Aug 20, 2015 at 5:44 PM Justin Uang [hidden email] wrote:

 I would prefer to just do it without the jar first as well. My hunch is
that to run spark the way it is intended, we need the wrapper scripts, like
spark-submit. Does anyone know authoritatively if that is the case?

 On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot [hidden email] wrote:

 +1
 But just to improve the error logging,
 would it be possible to add some warn logging in pyspark when the
SPARK_HOME env variable is pointing to a Spark distribution with a
different version from the pyspark package ?

 Regards,

 Olivier.

 2015-08-20 22:43 GMT+02:00 Brian Granger [hidden email]:

 I would start with just the plain python package without the JAR and
 then see if it makes sense to add the JAR over time.

 On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez [hidden email] wrote:
  Hi all,
 
  I wanted to bubble up a conversation from the PR to this discussion
to see
  if there is support the idea of including a Spark assembly JAR in a
PyPI
  release of pyspark. @holdenk recommended this as she already does so
in the
  Sparkling Pandas package. Is this something people are interesting in
  pursuing?
 
  -Auberon
 
  On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger [hidden email]
wrote:
 
  Auberon, can you also post this to the Jupyter Google Group?
 
  On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez [hidden email]
  wrote:
   Hi all,
  
   I've created an updated PR for this based off of the previous
work of
   @prabinb:
   https://github.com/apache/spark/pull/8318
  
   I am not very familiar with python packaging; feedback is
appreciated.
  
   -Auberon
  
   On Mon, Aug 10, 2015 at 12:45 PM, MinRK [hidden email] wrote:
  
  
   On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman [hidden email]
   wrote:
  
   I would tentatively suggest also conda packaging.
  
  
   A conda package has the advantage that it can be set up without
   'installing' the pyspark files, while the PyPI packaging is
still being
   worked out. It can just add a pyspark.pth file pointing to
pyspark,
   py4j
   locations. But I think it's a really good idea to package with
conda.
  
   -MinRK
  
  
  
   http://conda.pydata.org/docs/
  
   --Matthew Goodman
  
   =
   Check Out My Website: http://craneium.net
   Find me on LinkedIn: http://tinyurl.com/d6wlch
  
   On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu [hidden email]
   wrote:
  
   I think so, any contributions on this are welcome.
  
   On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger [hidden
email]
   wrote:
Sorry, trying to follow the context here. Does it look like
there
is
support for the idea of creating a setup.py file and pypi
package
for
pyspark?
   
Cheers,
   
Brian
   
On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu [hidden email]
wrote:
We could do that after 1.5 released, it will have same
release
cycle
as Spark in the future.
   
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
[hidden email] wrote:
+1 (once again :) )
   
2015-07-28 14:51 GMT+02:00 Justin Uang [hidden email]:
   
// ping
   
do we have any signoff from the pyspark devs to submit a
PR to
publish to
PyPI?
   
On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
[hidden email]
wrote:
   
Hey all, great discussion, just wanted to +1 that I see
a lot
of
value in
steps that make it easier to use PySpark as an ordinary
python
library.
   
You might want to check out this
(https://github.com/minrk/findspark),
started by Jupyter project devs, that offers one way to
facilitate
this
stuff. I’ve also cced them here to join the conversation.
   
Also, @Jey, I can also confirm that at least in some
scenarios
(I’ve done
it in an EC2 cluster in standalone mode) it’s possible
to run
PySpark jobs
just using `from pyspark import SparkContext; sc =
SparkContext(master=“X”)`
so long as the environmental variables (PYTHONPATH and
PYSPARK_PYTHON) are
set correctly on *both* workers and driver. That said,
there’s
definitely
additional configuration / functionality that would
require
going
through
the proper 

Re: PySpark on PyPi

2015-08-20 Thread Brian Granger
I would start with just the plain python package without the JAR and
then see if it makes sense to add the JAR over time.

On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com wrote:
 Hi all,

 I wanted to bubble up a conversation from the PR to this discussion to see
 if there is support the idea of including a Spark assembly JAR in a PyPI
 release of pyspark. @holdenk recommended this as she already does so in the
 Sparkling Pandas package. Is this something people are interesting in
 pursuing?

 -Auberon

 On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com wrote:

 Auberon, can you also post this to the Jupyter Google Group?

 On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez auberon.lo...@gmail.com
 wrote:
  Hi all,
 
  I've created an updated PR for this based off of the previous work of
  @prabinb:
  https://github.com/apache/spark/pull/8318
 
  I am not very familiar with python packaging; feedback is appreciated.
 
  -Auberon
 
  On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com wrote:
 
 
  On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com
  wrote:
 
  I would tentatively suggest also conda packaging.
 
 
  A conda package has the advantage that it can be set up without
  'installing' the pyspark files, while the PyPI packaging is still being
  worked out. It can just add a pyspark.pth file pointing to pyspark,
  py4j
  locations. But I think it's a really good idea to package with conda.
 
  -MinRK
 
 
 
  http://conda.pydata.org/docs/
 
  --Matthew Goodman
 
  =
  Check Out My Website: http://craneium.net
  Find me on LinkedIn: http://tinyurl.com/d6wlch
 
  On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com
  wrote:
 
  I think so, any contributions on this are welcome.
 
  On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com
  wrote:
   Sorry, trying to follow the context here. Does it look like there
   is
   support for the idea of creating a setup.py file and pypi package
   for
   pyspark?
  
   Cheers,
  
   Brian
  
   On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com
   wrote:
   We could do that after 1.5 released, it will have same release
   cycle
   as Spark in the future.
  
   On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
   o.girar...@lateral-thoughts.com wrote:
   +1 (once again :) )
  
   2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:
  
   // ping
  
   do we have any signoff from the pyspark devs to submit a PR to
   publish to
   PyPI?
  
   On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
   freeman.jer...@gmail.com
   wrote:
  
   Hey all, great discussion, just wanted to +1 that I see a lot
   of
   value in
   steps that make it easier to use PySpark as an ordinary python
   library.
  
   You might want to check out this
   (https://github.com/minrk/findspark),
   started by Jupyter project devs, that offers one way to
   facilitate
   this
   stuff. I’ve also cced them here to join the conversation.
  
   Also, @Jey, I can also confirm that at least in some scenarios
   (I’ve done
   it in an EC2 cluster in standalone mode) it’s possible to run
   PySpark jobs
   just using `from pyspark import SparkContext; sc =
   SparkContext(master=“X”)`
   so long as the environmental variables (PYTHONPATH and
   PYSPARK_PYTHON) are
   set correctly on *both* workers and driver. That said, there’s
   definitely
   additional configuration / functionality that would require
   going
   through
   the proper submit scripts.
  
   On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
   punya.bis...@gmail.com
   wrote:
  
   I agree with everything Justin just said. An additional
   advantage
   of
   publishing PySpark's Python code in a standards-compliant way
   is
   the fact
   that we'll be able to declare transitive dependencies (Pandas,
   Py4J) in a
   way that pip can use. Contrast this with the current situation,
   where
   df.toPandas() exists in the Spark API but doesn't actually work
   until you
   install Pandas.
  
   Punya
   On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
   justin.u...@gmail.com
   wrote:
  
   // + Davies for his comments
   // + Punya for SA
  
   For development and CI, like Olivier mentioned, I think it
   would
   be
   hugely beneficial to publish pyspark (only code in the python/
   dir) on PyPI.
   If anyone wants to develop against PySpark APIs, they need to
   download the
   distribution and do a lot of PYTHONPATH munging for all the
   tools
   (pylint,
   pytest, IDE code completion). Right now that involves adding
   python/ and
   python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
   add more
   dependencies, we would have to manually mirror all the
   PYTHONPATH
   munging in
   the ./pyspark script. With a proper pyspark setup.py which
   declares its
   dependencies, and a published distribution, depending on
   pyspark
   will just
   be adding pyspark to my setup.py dependencies.
  

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
I would prefer to just do it without the jar first as well. My hunch is
that to run spark the way it is intended, we need the wrapper scripts, like
spark-submit. Does anyone know authoritatively if that is the case?

On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 +1
 But just to improve the error logging,
 would it be possible to add some warn logging in pyspark when the
 SPARK_HOME env variable is pointing to a Spark distribution with a
 different version from the pyspark package ?

 Regards,

 Olivier.

 2015-08-20 22:43 GMT+02:00 Brian Granger elliso...@gmail.com:

 I would start with just the plain python package without the JAR and
 then see if it makes sense to add the JAR over time.

 On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com
 wrote:
  Hi all,
 
  I wanted to bubble up a conversation from the PR to this discussion to
 see
  if there is support the idea of including a Spark assembly JAR in a PyPI
  release of pyspark. @holdenk recommended this as she already does so in
 the
  Sparkling Pandas package. Is this something people are interesting in
  pursuing?
 
  -Auberon
 
  On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com
 wrote:
 
  Auberon, can you also post this to the Jupyter Google Group?
 
  On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez 
 auberon.lo...@gmail.com
  wrote:
   Hi all,
  
   I've created an updated PR for this based off of the previous work of
   @prabinb:
   https://github.com/apache/spark/pull/8318
  
   I am not very familiar with python packaging; feedback is
 appreciated.
  
   -Auberon
  
   On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com
 wrote:
  
  
   On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com
   wrote:
  
   I would tentatively suggest also conda packaging.
  
  
   A conda package has the advantage that it can be set up without
   'installing' the pyspark files, while the PyPI packaging is still
 being
   worked out. It can just add a pyspark.pth file pointing to pyspark,
   py4j
   locations. But I think it's a really good idea to package with
 conda.
  
   -MinRK
  
  
  
   http://conda.pydata.org/docs/
  
   --Matthew Goodman
  
   =
   Check Out My Website: http://craneium.net
   Find me on LinkedIn: http://tinyurl.com/d6wlch
  
   On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu 
 dav...@databricks.com
   wrote:
  
   I think so, any contributions on this are welcome.
  
   On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger 
 elliso...@gmail.com
   wrote:
Sorry, trying to follow the context here. Does it look like
 there
is
support for the idea of creating a setup.py file and pypi
 package
for
pyspark?
   
Cheers,
   
Brian
   
On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu 
 dav...@databricks.com
wrote:
We could do that after 1.5 released, it will have same release
cycle
as Spark in the future.
   
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
+1 (once again :) )
   
2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com
 :
   
// ping
   
do we have any signoff from the pyspark devs to submit a PR
 to
publish to
PyPI?
   
On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
freeman.jer...@gmail.com
wrote:
   
Hey all, great discussion, just wanted to +1 that I see a
 lot
of
value in
steps that make it easier to use PySpark as an ordinary
 python
library.
   
You might want to check out this
(https://github.com/minrk/findspark),
started by Jupyter project devs, that offers one way to
facilitate
this
stuff. I’ve also cced them here to join the conversation.
   
Also, @Jey, I can also confirm that at least in some
 scenarios
(I’ve done
it in an EC2 cluster in standalone mode) it’s possible to
 run
PySpark jobs
just using `from pyspark import SparkContext; sc =
SparkContext(master=“X”)`
so long as the environmental variables (PYTHONPATH and
PYSPARK_PYTHON) are
set correctly on *both* workers and driver. That said,
 there’s
definitely
additional configuration / functionality that would require
going
through
the proper submit scripts.
   
On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
punya.bis...@gmail.com
wrote:
   
I agree with everything Justin just said. An additional
advantage
of
publishing PySpark's Python code in a standards-compliant
 way
is
the fact
that we'll be able to declare transitive dependencies
 (Pandas,
Py4J) in a
way that pip can use. Contrast this with the current
 situation,
where
df.toPandas() exists in the Spark API but doesn't actually
 work
until you
install Pandas.
   
Punya
On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
justin.u...@gmail.com
wrote:
   
// + Davies for his comments
// + 

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
One other question: Do we have consensus on publishing the pip-installable
source distribution to PyPI? If so, is that something that the maintainers
need to add to the process that they use to publish releases?

On Thu, Aug 20, 2015 at 5:44 PM Justin Uang justin.u...@gmail.com wrote:

 I would prefer to just do it without the jar first as well. My hunch is
 that to run spark the way it is intended, we need the wrapper scripts, like
 spark-submit. Does anyone know authoritatively if that is the case?

 On Thu, Aug 20, 2015 at 4:54 PM Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 +1
 But just to improve the error logging,
 would it be possible to add some warn logging in pyspark when the
 SPARK_HOME env variable is pointing to a Spark distribution with a
 different version from the pyspark package ?

 Regards,

 Olivier.

 2015-08-20 22:43 GMT+02:00 Brian Granger elliso...@gmail.com:

 I would start with just the plain python package without the JAR and
 then see if it makes sense to add the JAR over time.

 On Thu, Aug 20, 2015 at 12:27 PM, Auberon Lopez auberon.lo...@gmail.com
 wrote:
  Hi all,
 
  I wanted to bubble up a conversation from the PR to this discussion to
 see
  if there is support the idea of including a Spark assembly JAR in a
 PyPI
  release of pyspark. @holdenk recommended this as she already does so
 in the
  Sparkling Pandas package. Is this something people are interesting in
  pursuing?
 
  -Auberon
 
  On Thu, Aug 20, 2015 at 10:03 AM, Brian Granger elliso...@gmail.com
 wrote:
 
  Auberon, can you also post this to the Jupyter Google Group?
 
  On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez 
 auberon.lo...@gmail.com
  wrote:
   Hi all,
  
   I've created an updated PR for this based off of the previous work
 of
   @prabinb:
   https://github.com/apache/spark/pull/8318
  
   I am not very familiar with python packaging; feedback is
 appreciated.
  
   -Auberon
  
   On Mon, Aug 10, 2015 at 12:45 PM, MinRK benjami...@gmail.com
 wrote:
  
  
   On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman meawo...@gmail.com
 
   wrote:
  
   I would tentatively suggest also conda packaging.
  
  
   A conda package has the advantage that it can be set up without
   'installing' the pyspark files, while the PyPI packaging is still
 being
   worked out. It can just add a pyspark.pth file pointing to pyspark,
   py4j
   locations. But I think it's a really good idea to package with
 conda.
  
   -MinRK
  
  
  
   http://conda.pydata.org/docs/
  
   --Matthew Goodman
  
   =
   Check Out My Website: http://craneium.net
   Find me on LinkedIn: http://tinyurl.com/d6wlch
  
   On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu 
 dav...@databricks.com
   wrote:
  
   I think so, any contributions on this are welcome.
  
   On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger 
 elliso...@gmail.com
   wrote:
Sorry, trying to follow the context here. Does it look like
 there
is
support for the idea of creating a setup.py file and pypi
 package
for
pyspark?
   
Cheers,
   
Brian
   
On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu 
 dav...@databricks.com
wrote:
We could do that after 1.5 released, it will have same release
cycle
as Spark in the future.
   
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
+1 (once again :) )
   
2015-07-28 14:51 GMT+02:00 Justin Uang 
 justin.u...@gmail.com:
   
// ping
   
do we have any signoff from the pyspark devs to submit a PR
 to
publish to
PyPI?
   
On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
freeman.jer...@gmail.com
wrote:
   
Hey all, great discussion, just wanted to +1 that I see a
 lot
of
value in
steps that make it easier to use PySpark as an ordinary
 python
library.
   
You might want to check out this
(https://github.com/minrk/findspark),
started by Jupyter project devs, that offers one way to
facilitate
this
stuff. I’ve also cced them here to join the conversation.
   
Also, @Jey, I can also confirm that at least in some
 scenarios
(I’ve done
it in an EC2 cluster in standalone mode) it’s possible to
 run
PySpark jobs
just using `from pyspark import SparkContext; sc =
SparkContext(master=“X”)`
so long as the environmental variables (PYTHONPATH and
PYSPARK_PYTHON) are
set correctly on *both* workers and driver. That said,
 there’s
definitely
additional configuration / functionality that would require
going
through
the proper submit scripts.
   
On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
punya.bis...@gmail.com
wrote:
   
I agree with everything Justin just said. An additional
advantage
of
publishing PySpark's Python code in a standards-compliant
 way
is
the fact
that we'll be able to declare transitive dependencies
 (Pandas,
Py4J) in a
way that pip can use. 

Re: PySpark on PyPi

2015-08-12 Thread quasiben
I've help to build a conda installable spark packages in the past.  You can
an older recipe here:
https://github.com/conda/conda-recipes/tree/master/spark

And I've been updating packages here: 
https://anaconda.org/anaconda-cluster/spark

`conda install -c anaconda-cluster spark` 

The above should work for OSX/Linux-64 and py27/py34 

--Ben 




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13659.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark on PyPi

2015-08-11 Thread westurner
Matt Goodman wrote
 I would tentatively suggest also conda packaging.
 
 http://conda.pydata.org/docs/

$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])

Docs for building conda packages for multiple operating systems and
interpreters from PyPi packages:

*
http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
* https://github.com/audreyr/cookiecutter/issues/232


Matt Goodman wrote
 --Matthew Goodman
 
 =
 Check Out My Website: http://craneium.net
 Find me on LinkedIn: http://tinyurl.com/d6wlch
 
 On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu lt;

 davies@

 gt; wrote:
 
 I think so, any contributions on this are welcome.

 On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger lt;

 ellisonbg@

 gt;
 wrote:
  Sorry, trying to follow the context here. Does it look like there is
  support for the idea of creating a setup.py file and pypi package for
  pyspark?
 
  Cheers,
 
  Brian
 
  On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu lt;

 davies@

 gt;
 wrote:
  We could do that after 1.5 released, it will have same release cycle
  as Spark in the future.
 
  On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
  lt;

 o.girardot@

 gt; wrote:
  +1 (once again :) )
 
  2015-07-28 14:51 GMT+02:00 Justin Uang lt;

 justin.uang@

 gt;:
 
  // ping
 
  do we have any signoff from the pyspark devs to submit a PR to
 publish to
  PyPI?
 
  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman 
 

 freeman.jeremy@


  wrote:
 
  Hey all, great discussion, just wanted to +1 that I see a lot of
 value in
  steps that make it easier to use PySpark as an ordinary python
 library.
 
  You might want to check out this
 (https://github.com/minrk/findspark
 ),
  started by Jupyter project devs, that offers one way to facilitate
 this
  stuff. I’ve also cced them here to join the conversation.
 
  Also, @Jey, I can also confirm that at least in some scenarios
 (I’ve
 done
  it in an EC2 cluster in standalone mode) it’s possible to run
 PySpark jobs
  just using `from pyspark import SparkContext; sc =
 SparkContext(master=“X”)`
  so long as the environmental variables (PYTHONPATH and
 PYSPARK_PYTHON) are
  set correctly on *both* workers and driver. That said, there’s
 definitely
  additional configuration / functionality that would require going
 through
  the proper submit scripts.
 
  On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal 
 

 punya.biswal@


  wrote:
 
  I agree with everything Justin just said. An additional advantage
 of
  publishing PySpark's Python code in a standards-compliant way is
 the
 fact
  that we'll be able to declare transitive dependencies (Pandas,
 Py4J)
 in a
  way that pip can use. Contrast this with the current situation,
 where
  df.toPandas() exists in the Spark API but doesn't actually work
 until you
  install Pandas.
 
  Punya
  On Wed, Jul 22, 2015 at 12:49 PM Justin Uang lt;

 justin.uang@

 gt;
  wrote:
 
  // + Davies for his comments
  // + Punya for SA
 
  For development and CI, like Olivier mentioned, I think it would
 be
  hugely beneficial to publish pyspark (only code in the python/
 dir)
 on PyPI.
  If anyone wants to develop against PySpark APIs, they need to
 download the
  distribution and do a lot of PYTHONPATH munging for all the tools
 (pylint,
  pytest, IDE code completion). Right now that involves adding
 python/ and
  python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
 more
  dependencies, we would have to manually mirror all the PYTHONPATH
 munging in
  the ./pyspark script. With a proper pyspark setup.py which
 declares
 its
  dependencies, and a published distribution, depending on pyspark
 will just
  be adding pyspark to my setup.py dependencies.
 
  Of course, if we actually want to run parts of pyspark that is
 backed by
  Py4J calls, then we need the full spark distribution with either
 ./pyspark
  or ./spark-submit, but for things like linting and development,
 the
  PYTHONPATH munging is very annoying.
 
  I don't think the version-mismatch issues are a compelling reason
 to not
  go ahead with PyPI publishing. At runtime, we should definitely
 enforce that
  the version has to be exact, which means there is no backcompat
 nightmare as
  suggested by Davies in
 https://issues.apache.org/jira/browse/SPARK-1267.
  This would mean that even if the user got his pip installed
 pyspark
 to
  somehow get loaded before the spark distribution provided pyspark,
 then the
  user would be alerted immediately.
 
  Davies, if you buy this, should me or someone on my team pick up
  https://issues.apache.org/jira/browse/SPARK-1267 and
  https://github.com/apache/spark/pull/464?
 
  On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
  lt;

 o.girardot@

 gt; wrote:
 
  Ok, I get it. Now what can we do to improve the current
 situation,
  because right now if I want to set-up a CI env for PySpark, I
 have
 to :
  1- download a pre-built version of pyspark and 

Re: PySpark on PyPi

2015-08-10 Thread Davies Liu
I think so, any contributions on this are welcome.

On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com wrote:
 Sorry, trying to follow the context here. Does it look like there is
 support for the idea of creating a setup.py file and pypi package for
 pyspark?

 Cheers,

 Brian

 On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote:
 We could do that after 1.5 released, it will have same release cycle
 as Spark in the future.

 On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:
 +1 (once again :) )

 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:

 // ping

 do we have any signoff from the pyspark devs to submit a PR to publish to
 PyPI?

 On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hey all, great discussion, just wanted to +1 that I see a lot of value in
 steps that make it easier to use PySpark as an ordinary python library.

 You might want to check out this (https://github.com/minrk/findspark),
 started by Jupyter project devs, that offers one way to facilitate this
 stuff. I’ve also cced them here to join the conversation.

 Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
 it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
 just using `from pyspark import SparkContext; sc = 
 SparkContext(master=“X”)`
 so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
 set correctly on *both* workers and driver. That said, there’s definitely
 additional configuration / functionality that would require going through
 the proper submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 I agree with everything Justin just said. An additional advantage of
 publishing PySpark's Python code in a standards-compliant way is the fact
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
 way that pip can use. Contrast this with the current situation, where
 df.toPandas() exists in the Spark API but doesn't actually work until you
 install Pandas.

 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com
 wrote:

 // + Davies for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be
 hugely beneficial to publish pyspark (only code in the python/ dir) on 
 PyPI.
 If anyone wants to develop against PySpark APIs, they need to download 
 the
 distribution and do a lot of PYTHONPATH munging for all the tools 
 (pylint,
 pytest, IDE code completion). Right now that involves adding python/ and
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
 dependencies, we would have to manually mirror all the PYTHONPATH 
 munging in
 the ./pyspark script. With a proper pyspark setup.py which declares its
 dependencies, and a published distribution, depending on pyspark will 
 just
 be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either 
 ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce 
 that
 the version has to be exact, which means there is no backcompat 
 nightmare as
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
 This would mean that even if the user got his pip installed pyspark to
 somehow get loaded before the spark distribution provided pyspark, then 
 the
 user would be alerted immediately.

 Davies, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation,
 because right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on
 every agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV
 project), I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory
 on every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched 
 when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 

Re: PySpark on PyPi

2015-08-10 Thread Matt Goodman
I would tentatively suggest also conda packaging.

http://conda.pydata.org/docs/

--Matthew Goodman

=
Check Out My Website: http://craneium.net
Find me on LinkedIn: http://tinyurl.com/d6wlch

On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu dav...@databricks.com wrote:

 I think so, any contributions on this are welcome.

 On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger elliso...@gmail.com
 wrote:
  Sorry, trying to follow the context here. Does it look like there is
  support for the idea of creating a setup.py file and pypi package for
  pyspark?
 
  Cheers,
 
  Brian
 
  On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com
 wrote:
  We could do that after 1.5 released, it will have same release cycle
  as Spark in the future.
 
  On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
  o.girar...@lateral-thoughts.com wrote:
  +1 (once again :) )
 
  2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:
 
  // ping
 
  do we have any signoff from the pyspark devs to submit a PR to
 publish to
  PyPI?
 
  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman 
 freeman.jer...@gmail.com
  wrote:
 
  Hey all, great discussion, just wanted to +1 that I see a lot of
 value in
  steps that make it easier to use PySpark as an ordinary python
 library.
 
  You might want to check out this (https://github.com/minrk/findspark
 ),
  started by Jupyter project devs, that offers one way to facilitate
 this
  stuff. I’ve also cced them here to join the conversation.
 
  Also, @Jey, I can also confirm that at least in some scenarios (I’ve
 done
  it in an EC2 cluster in standalone mode) it’s possible to run
 PySpark jobs
  just using `from pyspark import SparkContext; sc =
 SparkContext(master=“X”)`
  so long as the environmental variables (PYTHONPATH and
 PYSPARK_PYTHON) are
  set correctly on *both* workers and driver. That said, there’s
 definitely
  additional configuration / functionality that would require going
 through
  the proper submit scripts.
 
  On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal 
 punya.bis...@gmail.com
  wrote:
 
  I agree with everything Justin just said. An additional advantage of
  publishing PySpark's Python code in a standards-compliant way is the
 fact
  that we'll be able to declare transitive dependencies (Pandas, Py4J)
 in a
  way that pip can use. Contrast this with the current situation, where
  df.toPandas() exists in the Spark API but doesn't actually work
 until you
  install Pandas.
 
  Punya
  On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com
  wrote:
 
  // + Davies for his comments
  // + Punya for SA
 
  For development and CI, like Olivier mentioned, I think it would be
  hugely beneficial to publish pyspark (only code in the python/ dir)
 on PyPI.
  If anyone wants to develop against PySpark APIs, they need to
 download the
  distribution and do a lot of PYTHONPATH munging for all the tools
 (pylint,
  pytest, IDE code completion). Right now that involves adding
 python/ and
  python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
 more
  dependencies, we would have to manually mirror all the PYTHONPATH
 munging in
  the ./pyspark script. With a proper pyspark setup.py which declares
 its
  dependencies, and a published distribution, depending on pyspark
 will just
  be adding pyspark to my setup.py dependencies.
 
  Of course, if we actually want to run parts of pyspark that is
 backed by
  Py4J calls, then we need the full spark distribution with either
 ./pyspark
  or ./spark-submit, but for things like linting and development, the
  PYTHONPATH munging is very annoying.
 
  I don't think the version-mismatch issues are a compelling reason
 to not
  go ahead with PyPI publishing. At runtime, we should definitely
 enforce that
  the version has to be exact, which means there is no backcompat
 nightmare as
  suggested by Davies in
 https://issues.apache.org/jira/browse/SPARK-1267.
  This would mean that even if the user got his pip installed pyspark
 to
  somehow get loaded before the spark distribution provided pyspark,
 then the
  user would be alerted immediately.
 
  Davies, if you buy this, should me or someone on my team pick up
  https://issues.apache.org/jira/browse/SPARK-1267 and
  https://github.com/apache/spark/pull/464?
 
  On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
  o.girar...@lateral-thoughts.com wrote:
 
  Ok, I get it. Now what can we do to improve the current situation,
  because right now if I want to set-up a CI env for PySpark, I have
 to :
  1- download a pre-built version of pyspark and unzip it somewhere
 on
  every agent
  2- define the SPARK_HOME env
  3- symlink this distribution pyspark dir inside the python install
 dir
  site-packages/ directory
  and if I rely on additional packages (like databricks' Spark-CSV
  project), I have to (except if I'm mistaken)
  4- compile/assembly spark-csv, deploy the jar in a specific
 directory
  on every agent
  5- add this jar-filled directory 

Re: PySpark on PyPi

2015-08-06 Thread Davies Liu
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.

On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
 +1 (once again :) )

 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:

 // ping

 do we have any signoff from the pyspark devs to submit a PR to publish to
 PyPI?

 On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hey all, great discussion, just wanted to +1 that I see a lot of value in
 steps that make it easier to use PySpark as an ordinary python library.

 You might want to check out this (https://github.com/minrk/findspark),
 started by Jupyter project devs, that offers one way to facilitate this
 stuff. I’ve also cced them here to join the conversation.

 Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
 it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
 just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)`
 so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
 set correctly on *both* workers and driver. That said, there’s definitely
 additional configuration / functionality that would require going through
 the proper submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 I agree with everything Justin just said. An additional advantage of
 publishing PySpark's Python code in a standards-compliant way is the fact
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
 way that pip can use. Contrast this with the current situation, where
 df.toPandas() exists in the Spark API but doesn't actually work until you
 install Pandas.

 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com
 wrote:

 // + Davies for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be
 hugely beneficial to publish pyspark (only code in the python/ dir) on 
 PyPI.
 If anyone wants to develop against PySpark APIs, they need to download the
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
 pytest, IDE code completion). Right now that involves adding python/ and
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
 dependencies, we would have to manually mirror all the PYTHONPATH munging 
 in
 the ./pyspark script. With a proper pyspark setup.py which declares its
 dependencies, and a published distribution, depending on pyspark will just
 be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce 
 that
 the version has to be exact, which means there is no backcompat nightmare 
 as
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
 This would mean that even if the user got his pip installed pyspark to
 somehow get loaded before the spark distribution provided pyspark, then the
 user would be alerted immediately.

 Davies, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation,
 because right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on
 every agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV
 project), I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory
 on every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched 
 when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 spark-shell is downloading such dependencies automatically, I think if
 nothing's done yet it will (I guess ?).

 For step 3, maybe just adding a setup.py to the distribution would be
 enough, I'm not exactly advocating to distribute a full 300Mb spark
 distribution in PyPi, maybe there's a better compromise ?

 Regards,

 Olivier.

 Le ven. 5 juin 2015 à 22:12, Jey Kottalam 

Re: PySpark on PyPi

2015-07-28 Thread Justin Uang
// ping

do we have any signoff from the pyspark devs to submit a PR to publish to
PyPI?

On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Hey all, great discussion, just wanted to +1 that I see a lot of value in
 steps that make it easier to use PySpark as an ordinary python library.

 You might want to check out this (https://github.com/minrk/findspark),
 started by Jupyter project devs, that offers one way to facilitate this
 stuff. I’ve also cced them here to join the conversation.

 Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
 it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
 just using `from pyspark import SparkContext; sc =
 SparkContext(master=“X”)` so long as the environmental variables
 (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and
 driver. That said, there’s definitely additional configuration /
 functionality that would require going through the proper submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 I agree with everything Justin just said. An additional advantage of
 publishing PySpark's Python code in a standards-compliant way is the fact
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
 way that pip can use. Contrast this with the current situation, where
 df.toPandas() exists in the Spark API but doesn't actually work until you
 install Pandas.

 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com
 wrote:

 // + *Davies* for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be
 hugely beneficial to publish pyspark (only code in the python/ dir) on
 PyPI. If anyone wants to develop against PySpark APIs, they need to
 download the distribution and do a lot of PYTHONPATH munging for all the
 tools (pylint, pytest, IDE code completion). Right now that involves adding
 python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
 add more dependencies, we would have to manually mirror all the PYTHONPATH
 munging in the ./pyspark script. With a proper pyspark setup.py which
 declares its dependencies, and a published distribution, depending on
 pyspark will just be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce
 that the version has to be exact, which means there is no backcompat
 nightmare as suggested by Davies in
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that
 even if the user got his pip installed pyspark to somehow get loaded before
 the spark distribution provided pyspark, then the user would be alerted
 immediately.

 *Davies*, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation,
 because right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on
 every agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV
 project), I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory on
 every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 spark-shell is downloading such dependencies automatically, I think if
 nothing's done yet it will (I guess ?).

 For step 3, maybe just adding a setup.py to the distribution would be
 enough, I'm not exactly advocating to distribute a full 300Mb spark
 distribution in PyPi, maybe there's a better compromise ?

 Regards,

 Olivier.

 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a
 écrit :

 Couldn't we have a pip installable pyspark package that just serves
 as a shim to an existing Spark installation? Or it could even download the
 latest Spark binary if SPARK_HOME isn't set during installation. Right now,
 Spark doesn't play very well with 

Re: PySpark on PyPi

2015-07-24 Thread Jeremy Freeman
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps 
that make it easier to use PySpark as an ordinary python library.

You might want to check out this (https://github.com/minrk/findspark 
https://github.com/minrk/findspark), started by Jupyter project devs, that 
offers one way to facilitate this stuff. I’ve also cced them here to join the 
conversation.

Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in 
an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using 
`from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as 
the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly 
on *both* workers and driver. That said, there’s definitely additional 
configuration / functionality that would require going through the proper 
submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com 
 wrote:
 
 I agree with everything Justin just said. An additional advantage of 
 publishing PySpark's Python code in a standards-compliant way is the fact 
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way 
 that pip can use. Contrast this with the current situation, where 
 df.toPandas() exists in the Spark API but doesn't actually work until you 
 install Pandas.
 
 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com 
 mailto:justin.u...@gmail.com wrote:
 // + Davies for his comments
 // + Punya for SA
 
 For development and CI, like Olivier mentioned, I think it would be hugely 
 beneficial to publish pyspark (only code in the python/ dir) on PyPI. If 
 anyone wants to develop against PySpark APIs, they need to download the 
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint, 
 pytest, IDE code completion). Right now that involves adding python/ and 
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more 
 dependencies, we would have to manually mirror all the PYTHONPATH munging in 
 the ./pyspark script. With a proper pyspark setup.py which declares its 
 dependencies, and a published distribution, depending on pyspark will just be 
 adding pyspark to my setup.py dependencies.
 
 Of course, if we actually want to run parts of pyspark that is backed by Py4J 
 calls, then we need the full spark distribution with either ./pyspark or 
 ./spark-submit, but for things like linting and development, the PYTHONPATH 
 munging is very annoying.
 
 I don't think the version-mismatch issues are a compelling reason to not go 
 ahead with PyPI publishing. At runtime, we should definitely enforce that the 
 version has to be exact, which means there is no backcompat nightmare as 
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even 
 if the user got his pip installed pyspark to somehow get loaded before the 
 spark distribution provided pyspark, then the user would be alerted 
 immediately.
 
 Davies, if you buy this, should me or someone on my team pick up 
 https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267 and 
 https://github.com/apache/spark/pull/464 
 https://github.com/apache/spark/pull/464?
 
 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com 
 wrote:
 Ok, I get it. Now what can we do to improve the current situation, because 
 right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on every 
 agent
 2- define the SPARK_HOME env 
 3- symlink this distribution pyspark dir inside the python install dir 
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV project), I 
 have to (except if I'm mistaken) 
 4- compile/assembly spark-csv, deploy the jar in a specific directory on 
 every agent
 5- add this jar-filled directory to the Spark distribution's additional 
 classpath using the conf/spark-default file 
 
 Then finally we can launch our unit/integration-tests. 
 Some issues are related to spark-packages, some to the lack of python-based 
 dependency, and some to the way SparkContext are launched when using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering 
 spark-shell is downloading such dependencies automatically, I think if 
 nothing's done yet it will (I guess ?).
 
 For step 3, maybe just adding a setup.py to the distribution would be enough, 
 I'm not exactly advocating to distribute a full 300Mb spark distribution in 
 PyPi, maybe there's a better compromise ?
 
 Regards, 
 
 Olivier.
 
 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu 
 mailto:j...@cs.berkeley.edu a écrit :
 Couldn't we have a pip installable pyspark package that just serves as a 
 shim to an existing Spark 

Re: PySpark on PyPi

2015-07-22 Thread Justin Uang
// + *Davies* for his comments
// + Punya for SA

For development and CI, like Olivier mentioned, I think it would be hugely
beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
anyone wants to develop against PySpark APIs, they need to download the
distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
pytest, IDE code completion). Right now that involves adding python/ and
python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
dependencies, we would have to manually mirror all the PYTHONPATH munging
in the ./pyspark script. With a proper pyspark setup.py which declares its
dependencies, and a published distribution, depending on pyspark will just
be adding pyspark to my setup.py dependencies.

Of course, if we actually want to run parts of pyspark that is backed by
Py4J calls, then we need the full spark distribution with either ./pyspark
or ./spark-submit, but for things like linting and development, the
PYTHONPATH munging is very annoying.

I don't think the version-mismatch issues are a compelling reason to not go
ahead with PyPI publishing. At runtime, we should definitely enforce that
the version has to be exact, which means there is no backcompat nightmare
as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
This would mean that even if the user got his pip installed pyspark to
somehow get loaded before the spark distribution provided pyspark, then the
user would be alerted immediately.

*Davies*, if you buy this, should me or someone on my team pick up
https://issues.apache.org/jira/browse/SPARK-1267 and
https://github.com/apache/spark/pull/464?

On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation, because
 right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on every
 agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV project),
 I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory on
 every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 spark-shell is downloading such dependencies automatically, I think if
 nothing's done yet it will (I guess ?).

 For step 3, maybe just adding a setup.py to the distribution would be
 enough, I'm not exactly advocating to distribute a full 300Mb spark
 distribution in PyPi, maybe there's a better compromise ?

 Regards,

 Olivier.

 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit :

 Couldn't we have a pip installable pyspark package that just serves as
 a shim to an existing Spark installation? Or it could even download the
 latest Spark binary if SPARK_HOME isn't set during installation. Right now,
 Spark doesn't play very well with the usual Python ecosystem. For example,
 why do I need to use a strange incantation when booting up IPython if I
 want to use PySpark in a notebook with MASTER=local[4]? It would be much
 nicer to just type `from pyspark import SparkContext; sc =
 SparkContext(local[4])` in my notebook.

 I did a test and it seems like PySpark's basic unit-tests do pass when
 SPARK_HOME is set and Py4J is on the PYTHONPATH:


 PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
 python $SPARK_HOME/python/pyspark/rdd.py

 -Jey


 On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote:

 This has been proposed before:
 https://issues.apache.org/jira/browse/SPARK-1267

 There's currently tighter coupling between the Python and Java halves of
 PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
 we'd run into tons of issues when users try to run a newer version of the
 Python half of PySpark against an older set of Java components or
 vice-versa.

 On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Considering the python API as just a front needing the SPARK_HOME
 defined anyway, I think it would be interesting to deploy the Python part
 of Spark on PyPi in order to handle the dependencies in a Python project
 needing PySpark via pip.

 For now I just symlink the python/pyspark in my python install dir
 site-packages/ in order for PyCharm or other lint tools to work properly.
 I can do the setup.py work or anything.

 What do you think ?

 Regards,

 

Re: PySpark on PyPi

2015-07-22 Thread Punyashloka Biswal
I agree with everything Justin just said. An additional advantage of
publishing PySpark's Python code in a standards-compliant way is the fact
that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
way that pip can use. Contrast this with the current situation, where
df.toPandas() exists in the Spark API but doesn't actually work until you
install Pandas.

Punya
On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote:

 // + *Davies* for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be hugely
 beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
 anyone wants to develop against PySpark APIs, they need to download the
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
 pytest, IDE code completion). Right now that involves adding python/ and
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
 dependencies, we would have to manually mirror all the PYTHONPATH munging
 in the ./pyspark script. With a proper pyspark setup.py which declares its
 dependencies, and a published distribution, depending on pyspark will just
 be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce
 that the version has to be exact, which means there is no backcompat
 nightmare as suggested by Davies in
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that
 even if the user got his pip installed pyspark to somehow get loaded before
 the spark distribution provided pyspark, then the user would be alerted
 immediately.

 *Davies*, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation,
 because right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on
 every agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV
 project), I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory on
 every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 spark-shell is downloading such dependencies automatically, I think if
 nothing's done yet it will (I guess ?).

 For step 3, maybe just adding a setup.py to the distribution would be
 enough, I'm not exactly advocating to distribute a full 300Mb spark
 distribution in PyPi, maybe there's a better compromise ?

 Regards,

 Olivier.

 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit :

 Couldn't we have a pip installable pyspark package that just serves as
 a shim to an existing Spark installation? Or it could even download the
 latest Spark binary if SPARK_HOME isn't set during installation. Right now,
 Spark doesn't play very well with the usual Python ecosystem. For example,
 why do I need to use a strange incantation when booting up IPython if I
 want to use PySpark in a notebook with MASTER=local[4]? It would be much
 nicer to just type `from pyspark import SparkContext; sc =
 SparkContext(local[4])` in my notebook.

 I did a test and it seems like PySpark's basic unit-tests do pass when
 SPARK_HOME is set and Py4J is on the PYTHONPATH:


 PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
 python $SPARK_HOME/python/pyspark/rdd.py

 -Jey


 On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com
 wrote:

 This has been proposed before:
 https://issues.apache.org/jira/browse/SPARK-1267

 There's currently tighter coupling between the Python and Java halves
 of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
 we'd run into tons of issues when users try to run a newer version of the
 Python half of PySpark against an older set of Java components or
 vice-versa.

 On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com 

Re: PySpark on PyPi

2015-06-05 Thread Josh Rosen
This has been proposed before:
https://issues.apache.org/jira/browse/SPARK-1267

There's currently tighter coupling between the Python and Java halves of
PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
we'd run into tons of issues when users try to run a newer version of the
Python half of PySpark against an older set of Java components or
vice-versa.

On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Considering the python API as just a front needing the SPARK_HOME defined
 anyway, I think it would be interesting to deploy the Python part of Spark
 on PyPi in order to handle the dependencies in a Python project needing
 PySpark via pip.

 For now I just symlink the python/pyspark in my python install dir
 site-packages/ in order for PyCharm or other lint tools to work properly.
 I can do the setup.py work or anything.

 What do you think ?

 Regards,

 Olivier.



Re: PySpark on PyPi

2015-06-05 Thread Jey Kottalam
Couldn't we have a pip installable pyspark package that just serves as a
shim to an existing Spark installation? Or it could even download the
latest Spark binary if SPARK_HOME isn't set during installation. Right now,
Spark doesn't play very well with the usual Python ecosystem. For example,
why do I need to use a strange incantation when booting up IPython if I
want to use PySpark in a notebook with MASTER=local[4]? It would be much
nicer to just type `from pyspark import SparkContext; sc =
SparkContext(local[4])` in my notebook.

I did a test and it seems like PySpark's basic unit-tests do pass when
SPARK_HOME is set and Py4J is on the PYTHONPATH:


PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
python $SPARK_HOME/python/pyspark/rdd.py

-Jey


On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote:

 This has been proposed before:
 https://issues.apache.org/jira/browse/SPARK-1267

 There's currently tighter coupling between the Python and Java halves of
 PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
 we'd run into tons of issues when users try to run a newer version of the
 Python half of PySpark against an older set of Java components or
 vice-versa.

 On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Considering the python API as just a front needing the SPARK_HOME defined
 anyway, I think it would be interesting to deploy the Python part of Spark
 on PyPi in order to handle the dependencies in a Python project needing
 PySpark via pip.

 For now I just symlink the python/pyspark in my python install dir
 site-packages/ in order for PyCharm or other lint tools to work properly.
 I can do the setup.py work or anything.

 What do you think ?

 Regards,

 Olivier.





Re: PySpark on PyPi

2015-06-05 Thread Olivier Girardot
Ok, I get it. Now what can we do to improve the current situation, because
right now if I want to set-up a CI env for PySpark, I have to :
1- download a pre-built version of pyspark and unzip it somewhere on every
agent
2- define the SPARK_HOME env
3- symlink this distribution pyspark dir inside the python install dir
site-packages/ directory
and if I rely on additional packages (like databricks' Spark-CSV project),
I have to (except if I'm mistaken)
4- compile/assembly spark-csv, deploy the jar in a specific directory on
every agent
5- add this jar-filled directory to the Spark distribution's additional
classpath using the conf/spark-default file

Then finally we can launch our unit/integration-tests.
Some issues are related to spark-packages, some to the lack of python-based
dependency, and some to the way SparkContext are launched when using
pyspark.
I think step 1 and 2 are fair enough
4 and 5 may already have solutions, I didn't check and considering
spark-shell is downloading such dependencies automatically, I think if
nothing's done yet it will (I guess ?).

For step 3, maybe just adding a setup.py to the distribution would be
enough, I'm not exactly advocating to distribute a full 300Mb spark
distribution in PyPi, maybe there's a better compromise ?

Regards,

Olivier.

Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit :

 Couldn't we have a pip installable pyspark package that just serves as a
 shim to an existing Spark installation? Or it could even download the
 latest Spark binary if SPARK_HOME isn't set during installation. Right now,
 Spark doesn't play very well with the usual Python ecosystem. For example,
 why do I need to use a strange incantation when booting up IPython if I
 want to use PySpark in a notebook with MASTER=local[4]? It would be much
 nicer to just type `from pyspark import SparkContext; sc =
 SparkContext(local[4])` in my notebook.

 I did a test and it seems like PySpark's basic unit-tests do pass when
 SPARK_HOME is set and Py4J is on the PYTHONPATH:


 PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
 python $SPARK_HOME/python/pyspark/rdd.py

 -Jey


 On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote:

 This has been proposed before:
 https://issues.apache.org/jira/browse/SPARK-1267

 There's currently tighter coupling between the Python and Java halves of
 PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
 we'd run into tons of issues when users try to run a newer version of the
 Python half of PySpark against an older set of Java components or
 vice-versa.

 On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Considering the python API as just a front needing the SPARK_HOME
 defined anyway, I think it would be interesting to deploy the Python part
 of Spark on PyPi in order to handle the dependencies in a Python project
 needing PySpark via pip.

 For now I just symlink the python/pyspark in my python install dir
 site-packages/ in order for PyCharm or other lint tools to work properly.
 I can do the setup.py work or anything.

 What do you think ?

 Regards,

 Olivier.