Github user Stibbons commented on the issue:
https://github.com/apache/spark/pull/14963
Indeed, I have not taken into account Anaconda environment, first because
this tool provide quick and efficient way of having an almost good working
environment to run jobs with numpy, pandas, and so on. There are many conda
packages available to download, it could almost make pip irrelevant.
But actually not. Indeed, Anaconda and pip do not really play in the same
garden. Anaconda is great for one shot job or numpy script that needs to run is
a quite well controlled environment, but this level of control is not that
great. You can say you want numpy v1.1.0 but the user could have chosen another
"channel" that the default one because someelse has compiled for the
architecture. For example you can use `conda install -c intel` to download
intel optimized libraries. This is great but there is too much latitude errors
to occur. Reproducibility is not as much guaranteed than with pip
At least on pypi you have the default libraries and you are guaranteed to
have the original package, unpatched, and if the developer has provided
"wheels", you might even download the precompiled binaries (see
[numpy](https://pypi.python.org/pypi/numpy) packages).
And you want to have a great control on the deployment of your master
environment. Services such as heroku or similar does not provide a conda
environment by default for the deployment of the application, if user wants it
is can download it of course.
**TL;DR**, executing the Spark master inside a conda environment is a BAD
idea. Executing Driver and Executor inside a Conda environment makes totally
sense on the other hand, and this is the proposal for this PR we are working on
with Jeff: #14180.
So, for this PR, for this PR we have several choices:
- trust the user: he knows what he does. If we are inside a conda
environment, we can install packages with `conda install`. I must admit this is
not my favorite choice, since altering the conda environment might have
unwanted behavior (if you dont use the right channel, you might replace some
dependencies without the user acknowledge). Using pip from within a conda
environment is not recommend at all
- detect if we are inside a conda environment, leave it unconditionally
inside at the script. Of course this only impact the `lint-python` script. I
prefer this solution since we fallback into a nice controlled virtualenv for
lint script
I use quite a lot conda nowadays, playing with sckitlearn and alike, I must
admit I am well impressed by the quality of the features and precompiled
packages it provides, and I find it really useful for computation jobs, even
for Spark job.
For the output, I'll make better use of the `--debug` argument
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]