- to query, say, a month's worth of data, what would need to be done? - "sharded by day" ... UTC?
On Saturday, May 21, 2016, Donald Stufft <don...@stufft.io> wrote: > Hey, > > One thing I’ve been working on as part of Warehouse, is a subproject that > I call “Linehaul”. This is essentially a little statistics daemon that will > take specially formatted syslog messages coming off of Fastly and shove > them inside of a BigQuery database. I’m happy to report that I’ve just > finished the production deployment of this and we’re now sending every > download event that hits Fastly into BigQuery. > > First off, I’d like to thank Felipe Hoffa, Will Curran, and Preston Holmes > over at Google for helping me get credits for this sorted out so that we > can actually get this going! They’ve been a big help. > > So onto what this means. > > Basically, BigQuery gives us the ability to relatively quickly (typically > < 60s) query very large datasets using something that is very similar to > SQL. Unlike typical time series databases we don’t have to know ahead of > time what we want to query on, we can just insert data into rows in a table > (and our tables are sharded by days) and then using the SQLlike query > language, you can do any sort of query you like. > > On top of all of this, BigQuery gives me the ability to share the dataset > publicly with anyone who is logged into a Google account, which means that > *anyone* can query this data and look for any sort of interesting > information they can find in it. The cost of any queries you run will be > associated with your own account (but the first 1TB of data a month that > you query is free I believe, nor are you charged for queries that error out > or return cached results). > > Anyways, you can query these BigQuery tables whose names match the pattern > `the-psf:pypi.downloadsYYYYMMDD` and you can see whatever data you want. > > We’ve only just started recorded the data in this spot so right now there > isn’t a whole lot of data available (but it’s constantly streaming in). > Once the 22nd rolls over I’m going to delete what data we have available > for the 21st, and then start backfilling historical data starting with the > 21st and going backwards. You should be able to run queries in a Web UI by > navigating to https://bigquery.cloud.google.com/dataset/the-psf:pypi (you > might have to accept a Terms of Service). > > The table schema looks like: https://s.caremad.io/lPpTF6rxWZ/ but it > should also be visible on the big query page. Some example queries you > might want to run are located at > https://gist.github.com/alex/4f100a9592b05e9b4d63 (but note, that’s > currently using the *old* not publicly available table name, you’ll need to > replace [long-stack-762:pypi.downloads] with [the-psf.pypi:downloads]). > If you want to write your own queries, you should be able to find the > syntax here: https://cloud.google.com/bigquery/query-reference > > Anyways, new data should constantly be streaming in, and I should be able > to backfill data all the way to Jan of 2014 or so. Hopefully this is useful > to folks, and if you find any interesting queries or numbers, please share > them! > > — > Donald Stufft > > > > _______________________________________________ > Distutils-SIG maillist - Distutils-SIG@python.org <javascript:;> > https://mail.python.org/mailman/listinfo/distutils-sig >
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig