Hi Laurent, and thanks for asking. I re-clustered the tables - find my work-log and notes here:
- https://medium.com/@hoffa/python-pypi-stats-in-bigquery-reclustered-d80e583e1bfe If you use my tables, a query that used to process 200.88GB is now only scanning 9.65GB - when filtering for a particular package. 95% reductions! For example: SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python , COUNT(*) downloads FROM `the-psf.pypi.downloads2017*` WHERE file.project='pyspark' GROUP BY week, python HAVING python != '3.6' AND week<'2017-12-30' ORDER BY week -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-le...@python.org https://mail.python.org/mm3/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/mm3/archives/list/distutils-sig@python.org/message/HXOKILZGO7EFOEH5TGXIKCNXM5QYZ56N/