[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

Stibbons Wed, 12 Oct 2016 11:37:34 -0700

Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14963
  
    Indeed, I have not taken into account Anaconda environment, first because 
this tool provide quick and efficient way of having an almost good working 
environment to run jobs with numpy, pandas, and so on. There are many conda 
packages available to download, it could almost make pip irrelevant.
    
    But actually not. Indeed, Anaconda and pip do not really play in the same 
garden. Anaconda is great for one shot job or numpy script that needs to run is 
a quite well controlled environment, but this level of control is not that 
great. You can say you want numpy v1.1.0 but the user could have chosen another 
"channel" that the default one because someelse has compiled for the 
architecture. For example you can use `conda install -c intel` to download 
intel optimized libraries. This is great but there is too much latitude errors 
to occur. Reproducibility is not as much guaranteed than with pip
    
    At least on pypi you have the default libraries and you are guaranteed to 
have the original package, unpatched, and if the developer has provided 
"wheels", you might even download the precompiled binaries (see 
[numpy](https://pypi.python.org/pypi/numpy) packages).
    
    And you want to have a great control on the deployment of your master 
environment. Services such as heroku or similar does not provide a conda 
environment by default for the deployment of the application, if user wants it 
is can download it of course.
    
    **TL;DR**, executing the Spark master inside a conda environment is a BAD 
idea. Executing Driver and Executor inside a Conda environment makes totally 
sense on the other hand, and this is the proposal for this PR we are working on 
with Jeff: #14180.
    
    So, for this PR, for this PR we have several choices:
    - trust the user: he knows what he does. If we are inside a conda 
environment, we can install packages with `conda install`. I must admit this is 
not my favorite choice, since altering the conda environment might have 
unwanted behavior (if you dont use the right channel, you might replace some 
dependencies without the user acknowledge). Using pip from within a conda 
environment is not recommend at all
    - detect if we are inside a conda environment, leave it unconditionally 
inside at the script. Of course this only impact the `lint-python` script. I 
prefer this solution since we fallback into a nice controlled virtualenv for 
lint script
    
    I use quite a lot conda nowadays, playing with sckitlearn and alike, I must 
admit I am well impressed by the quality of the features and precompiled 
packages it provides, and I find it really useful for computation jobs, even 
for Spark job.
    
    For the output, I'll make better use of the `--debug` argument



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

Reply via email to