Thank you Guillaume for your help, When I start a Spark cluster on AWS, I add a bootstrap step to update pip and install sklearn so that users no longer have to install scikit-learn in their job with sc.install_pypi_package.
We are using Spark with sklearn to run hyper-parameter tuning using spark to run many model configurations in parallel (broadcasting the pandas dataframe and running independent models on each Spark container). That is why we need to have scikit learn installed on each worker node. This technique works very well conditional that the pandas dataframe fits in the container memory (each spark container will have a copy of the pandas dataframe). Thank you for your great work and help, Cheers, Bertrand Le ven. 22 janv. 2021 à 09:06, Guillaume Lemaître <g.lemaitr...@gmail.com> a écrit : > OK, so the normal install is working. Now, to fix your issue we need to > understand how `sc.install_pypi_package` is working and mainly how does it > call `pip`. We need to make sure that it call the right pip (the system > `pip3` in your case). > > > On Fri, 22 Jan 2021 at 14:39, Bertrand B. <bertrand25...@gmail.com> wrote: > >> Thank you Guillaume for your help, >> >> I am using : (running on AWS EMR-6.2) >> pip3 --version >> pip 9.0.3 from /usr/lib/python3.7/site-packages (python 3.7) >> >> >> pip3 install scikit-learn >> >> Collecting scikit-learn >> Using cached >> https://files.pythonhosted.org/packages/f4/7b/d415b0c89babf23dcd8ee631015f043e2d76795edd9c7359d6e63257464b/scikit-learn-0.24.1.tar.gz >> Requirement already satisfied: numpy>=1.13.3 in >> /usr/local/lib64/python3.7/site-packages (from scikit-learn) >> Collecting scipy>=0.19.1 (from scikit-learn) >> Using cached >> https://files.pythonhosted.org/packages/58/9d/8296d8211318d690119eba6d293b7a149c1c51c945342dd4c3816f79e1ba/scipy-1.6.0-cp37-cp37m-manylinux1_x86_64.whl >> Requirement already satisfied: joblib>=0.11 in >> /usr/local/lib64/python3.7/site-packages (from scikit-learn) >> Requirement already satisfied: threadpoolctl>=2.0.0 in >> /usr/local/lib/python3.7/site-packages (from scikit-learn) >> Installing collected packages: scipy, scikit-learn >> Running setup.py install for scikit-learn ... error >> Complete output from command /usr/bin/python3 -u -c "import >> setuptools, >> tokenize;__file__='/mnt/tmp/pip-build-93pagltp/scikit-learn/setup.py';f=getattr(tokenize, >> 'open', open)(__file__);code=f.read().replace('\r\n', >> '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record >> /tmp/pip-0ulalx36-record/install-record.txt >> --single-version-externally-managed --compile: >> Partial import of sklearn during the build process. >> Traceback (most recent call last): >> File >> "/mnt/tmp/pip-build-93pagltp/scikit-learn/sklearn/_build_utils/__init__.py", >> line 27, in _check_cython_version >> import Cython >> ModuleNotFoundError: No module named 'Cython' >> >> >> Upgrading pip to 20.3.3 : >> >> sudo pip3 install --upgrade pip >> sudo ln -s /usr/local/bin/pip3 /usr/bin/pip3 >> >> pip3 --version >> pip 20.3.3 from /usr/local/lib/python3.7/site-packages/pip (python 3.7) >> >> let me install from the whl file : >> pip3 install scikit-learn >> Collecting scikit-learn >> Downloading scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl >> (22.3 MB) >> >> However, using the API sc.install_pypi_package("scikit-learn") still uses >> the tar file instead of the whl file (even after the pip upgrade). >> >> Collecting scikit-learn >> Using cached >> https://files.pythonhosted.org/packages/f4/7b/d415b0c89babf23dcd8ee631015f043e2d76795edd9c7359d6e63257464b/scikit-learn-0.24.1.tar.gz >> >> >> Thanks for your help, >> >> Cheers, >> >> Bertrand >> >> Le ven. 22 janv. 2021 à 04:13, Guillaume Lemaître <g.lemaitr...@gmail.com> >> a écrit : >> >>> @Bertrand Could you tell us which version of `pip` to you use (you need >>> pip >= 19.0 for manylinux2010 and pip >= 19.3 for manylinux2014) >>> >>> On Fri, 22 Jan 2021 at 09:49, Guillaume Lemaître <g.lemaitr...@gmail.com> >>> wrote: >>> >>>> We might experience an issue with PyPI not selecting the manylinux2010 >>>> wheel: https://github.com/scikit-learn/scikit-learn/issues/19233 >>>> We have to check but we will probably shortly upload manylinux1 wheels >>>> that should resolve the issue. >>>> >>>> I am curious if fetching the wheel by hand and installing via `pip` >>>> would be a workaround (not practical for automated usage thought). >>>> >>>> On Thu, 21 Jan 2021 at 00:34, The Helmbolds via scikit-learn < >>>> scikit-learn@python.org> wrote: >>>> >>>>> Use the Anaconda Python installation. >>>>> >>>>> "You won't find the right answers if you don't ask the right >>>>> questions!" (Robert Helmbold, 2013) >>>>> >>>>> >>>>> On Wednesday, January 20, 2021, 04:16:15 PM MST, Guillaume Lemaître < >>>>> g.lemaitr...@gmail.com> wrote: >>>>> >>>>> >>>>> Basically it get the tar with the source and recompile instead of >>>>> using the wheel. Could you force an install from PyPI without using the >>>>> cached file. >>>>> >>>>> We pushed wheels yesterday for 0.24.1 as well so it should not get the >>>>> 0.24.0 version. >>>>> >>>>> For 0.23.2, you can see that it used the wheel (.whl). >>>>> >>>>> Sent from my phone - sorry to be brief and potential misspell. >>>>> *From:* bertrand25...@gmail.com >>>>> *Sent:* 20 January 2021 23:21 >>>>> *To:* scikit-learn@python.org >>>>> *Reply to:* scikit-learn@python.org >>>>> *Subject:* [scikit-learn] scikit-learn 0.24 installation fails with >>>>> ModuleNotFoundError: No module named 'scipy' >>>>> >>>>> To whom it may concern, >>>>> >>>>> I am trying to install scikit-learn in a PySpark job using the >>>>> install_pypi_package PySpark API but the install fails with : >>>>> >>>>> sc.install_pypi_package("scikit-learn") >>>>> >>>>> Collecting scikit-learn >>>>> Using cached >>>>> https://files.pythonhosted.org/packages/db/e2/9c0bde5f81394b627f623557690536b12017b84988a4a1f98ec826edab9e/scikit-learn-0.24.0.tar.gz >>>>> Requirement already satisfied: numpy>=1.13.3 in >>>>> /usr/local/lib64/python3.7/site-packages (from scikit-learn) >>>>> Collecting scipy>=0.19.1 (from scikit-learn) >>>>> Using cached >>>>> https://files.pythonhosted.org/packages/58/9d/8296d8211318d690119eba6d293b7a149c1c51c945342dd4c3816f79e1ba/scipy-1.6.0-cp37-cp37m-manylinux1_x86_64.whl >>>>> Requirement already satisfied: joblib>=0.11 in >>>>> /usr/local/lib64/python3.7/site-packages (from scikit-learn) >>>>> Collecting threadpoolctl>=2.0.0 (from scikit-learn) >>>>> Using cached >>>>> https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl >>>>> Building wheels for collected packages: scikit-learn >>>>> Running setup.py bdist_wheelfor scikit-learn: started >>>>> Running setup.py bdist_wheelfor scikit-learn: finished with status >>>>> 'error' >>>>> Complete output from command /tmp/1611000009300-0/bin/python -u -c >>>>> "import setuptools, >>>>> tokenize;__file__='/mnt/tmp/pip-build-phc6p6gl/scikit-learn/setup.py >>>>> ';f=getattr(tokenize, 'open', open)(__file__);code=f.read >>>>> ().replace('\r\n', '\n');f.close ();exec(compile(code, __file__, >>>>> 'exec'))" bdist_wheel -d /tmp/tmpry3gf9r0pip-wheel- --python-tag cp37: >>>>> Partial import of sklearn during the build process. >>>>> Traceback (most recent call last): >>>>> File "/mnt/tmp/pip-build-phc6p6gl/scikit-learn/setup.py ", line 201, >>>>> in check_package_status >>>>> module = importlib.import_module(package) >>>>> File "/tmp/1611000009300-0/lib64/python3.7/importlib/__init__.py", >>>>> line 127, in import_module >>>>> return _bootstrap._gcd_import(name[level:], package, level) >>>>> File "<frozen importlib._bootstrap>", line 1006, in _gcd_import >>>>> File "<frozen importlib._bootstrap>", line 983, in _find_and_load >>>>> File "<frozen importlib._bootstrap>", line 965, in >>>>> _find_and_load_unlocked >>>>> ModuleNotFoundError: No module named 'scipy' >>>>> Traceback (most recent call last): >>>>> File "<string>", line 1, in <module> >>>>> File "/mnt/tmp/pip-build-phc6p6gl/scikit-learn/setup.py ", line 306, >>>>> in <module> >>>>> setup_package() >>>>> File "/mnt/tmp/pip-build-phc6p6gl/scikit-learn/setup.py ", line 294, >>>>> in setup_package >>>>> check_package_status('scipy', min_deps.SCIPY_MIN_VERSION) >>>>> File "/mnt/tmp/pip-build-phc6p6gl/scikit-learn/setup.py ", line 227, >>>>> in check_package_status >>>>> .format(package, req_str, instructions)) >>>>> ImportError: scipy is not installed. >>>>> scikit-learn requires scipy >= 0.19.1. >>>>> >>>>> I do not encounter this error with scikit-learn 0.23.2 : >>>>> >>>>> sc.install_pypi_package("scikit-learn==0.23.2") >>>>> >>>>> Collecting scikit-learn==0.23.2 >>>>> Using cached >>>>> https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl >>>>> Requirement already satisfied: scipy>=0.19.1 in >>>>> /mnt/tmp/1611000009300-0/lib/python3.7/site-packages (from >>>>> scikit-learn==0.23.2) >>>>> Requirement already satisfied: numpy>=1.13.3 in >>>>> /usr/local/lib64/python3.7/site-packages (from scikit-learn==0.23.2) >>>>> Requirement already satisfied: joblib>=0.11 in >>>>> /usr/local/lib64/python3.7/site-packages (from scikit-learn==0.23.2) >>>>> Requirement already satisfied: threadpoolctl>=2.0.0 in >>>>> /mnt/tmp/1611000009300-0/lib/python3.7/site-packages (from >>>>> scikit-learn==0.23.2) >>>>> Installing collected packages: scikit-learn >>>>> Successfully installed scikit-learn-0.23.2 >>>>> >>>>> >>>>> Could you please help me understand why the scikit-learn 0.24 >>>>> installation fails ? >>>>> >>>>> Thank you for your help, >>>>> >>>>> Bertrand >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> Scikit-learn @ Inria Foundation >>>> https://glemaitre.github.io/ >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> Scikit-learn @ Inria Foundation >>> https://glemaitre.github.io/ >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn