Hi Laurent, and thanks for asking.

I re-clustered the tables - find my work-log and notes here:

- 
https://medium.com/@hoffa/python-pypi-stats-in-bigquery-reclustered-d80e583e1bfe

If you use my tables, a query that used to process 200.88GB is now only 
scanning 9.65GB - when filtering for a particular package. 95% reductions!

For example:

SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week
  , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python
  , COUNT(*) downloads
FROM `the-psf.pypi.downloads2017*`
WHERE file.project='pyspark'
GROUP BY week, python
HAVING python != '3.6' AND week<'2017-12-30'
ORDER BY week
--
Distutils-SIG mailing list -- distutils-sig@python.org
To unsubscribe send an email to distutils-sig-le...@python.org
https://mail.python.org/mm3/mailman3/lists/distutils-sig.python.org/
Message archived at 
https://mail.python.org/mm3/archives/list/distutils-sig@python.org/message/HXOKILZGO7EFOEH5TGXIKCNXM5QYZ56N/

Reply via email to