Justin, If you reduce the amount of data you are sampling queries will be faster, instead of sampling one month try an hour.
Thanks, Nuria On Fri, Apr 22, 2016 at 3:57 PM, Justin Clark <[email protected]> wrote: > Hi all, > > I'm a researcher at the Berkman Center for Internet & Society at Harvard > doing some work on anomaly detection against Wikipedia article request > volumes. > > I'd like to create time series of request volumes for as many > article-country pairs as is possible. I'm using a number of different data > sets, but the most useful for our purposes is the pageview_hourly table. I > understand that this is a tremendous amount of data, and we are in the > process of prioritizing the article-country pairs, but my question is: what > is the best/fastest way to query this data from hive? Writing a query that > gets at the data is not a problem, but I'm curious about possible > strategies that could speed up the process. > > Here is a reference query that shows the kind of data I'm looking for: > > SELECT view_count FROM pageview_hourly WHERE > year = 2015 AND > month = 1 AND > page_title = 'World_War_II' AND > country_code = 'CA' AND > agent_type = 'user' > ORDER BY day, hour; > > A couple options that come to mind: > * a year > 0 query vs many yearly, monthly, daily, or hourly queries > * batching articles with page_title IN (...) > * dropping country_code to get all countries at once (or batch like > above) > * ordering posthoc to avoid the map-reduce overhead > > Because there's so much data and a query like the above takes ~10 minutes, > experimenting with these is a long process. I was hoping someone more > familiar could share any magic that might speed things up (or tell me > there's no magic bullet and everything will take about as long as > everything else). If no one can say quickly off the top of their head, I > can just do that experimentation, but more options to try are totally > welcome. > > Thanks, > Justin > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
