Hi Justin,
Thanks for your understanding in the amount of data you are getting to :)
A few Things to consider:

   - The hive table is partitioned by year / month / day / hourFor testing
   purposes, having those four values set makes the volume of data to scan
   much smaller. (~350G per month, therefore ~11G per day, 0.5G per hour)


   -




On Sat, Apr 23, 2016 at 12:57 AM, Justin Clark <[email protected]
> wrote:

> Hi all,
>
> I'm a researcher at the Berkman Center for Internet & Society at Harvard
> doing some work on anomaly detection against Wikipedia article request
> volumes.
>
> I'd like to create time series of request volumes for as many
> article-country pairs as is possible. I'm using a number of different data
> sets, but the most useful for our purposes is the pageview_hourly table. I
> understand that this is a tremendous amount of data, and we are in the
> process of prioritizing the article-country pairs, but my question is: what
> is the best/fastest way to query this data from hive? Writing a query that
> gets at the data is not a problem, but I'm curious about possible
> strategies that could speed up the process.
>
> Here is a reference query that shows the kind of data I'm looking for:
>
> SELECT view_count FROM pageview_hourly WHERE
> year = 2015 AND
> month = 1 AND
> page_title = 'World_War_II' AND
> country_code = 'CA' AND
> agent_type = 'user'
> ORDER BY day, hour;
>
> A couple options that come to mind:
>   * a year > 0 query vs many yearly, monthly, daily, or hourly queries
>   * batching articles with page_title IN (...)
>   * dropping country_code to get all countries at once (or batch like
> above)
>   * ordering posthoc to avoid the map-reduce overhead
>
> Because there's so much data and a query like the above takes ~10 minutes,
> experimenting with these is a long process. I was hoping someone more
> familiar could share any magic that might speed things up (or tell me
> there's no magic bullet and everything will take about as long as
> everything else). If no one can say quickly off the top of their head, I
> can just do that experimentation, but more options to try are totally
> welcome.
>
> Thanks,
>   Justin
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to