Fokko commented on pull request #28957:
URL: https://github.com/apache/spark/pull/28957#issuecomment-652022449
My pleasure @holdenk
I ran a query against the public dataset of Google. They have a dataset that
contains all the public pypi downloads:
```sql
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
EXTRACT(MONTH FROM timestamp) AS month,
SAFE.SUBSTR(details.python, 0, 3) AS python_version,
COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY
EXTRACT(YEAR FROM timestamp),
EXTRACT(MONTH FROM timestamp),
SAFE.SUBSTR(details.python, 0, 3)
```
This gives us the following per month:

We can see that the majority uses 3.7 and 3.6. However, there is still a
share of 3.5 and 2.7.
If we look at the proportional share of people who'm using a compatible
version:
```sql
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
EXTRACT(MONTH FROM timestamp) AS month,
if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok') as OK,
COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pyspark'
AND SAFE.SUBSTR(details.python, 0, 3) IS NOT NULL
GROUP BY
EXTRACT(YEAR FROM timestamp),
EXTRACT(MONTH FROM timestamp),
if(SAFE.SUBSTR(details.python, 0, 3) >= '3.6', 'ok', 'not_ok')
```
Then the majority is ok:

The next question would be if Python <3.6 users are on 3.0 or on 2.x. My
guess would be the latter, so we're (mostly) safe deprecating the old versions
of Python.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]