Hi Dan, Making dumps much easier to use would definitely help. We Wikipedia researchers are kind of spoiled: we have easy public access to historical revision data for all projects, going back to 2001, through the API *and* public db endpoints like Quarry. It's only natural that we want the same thing with pageviews!!! :)
I can think of other use-cases for keeping more than 18 months of data available through the API, but they're all research use cases. I don't think having lower-granularity historical data available beyond a certain point is helpful for those--if you're doing historical analysis, you want consistency. But a application that parsed dumps on the server-side to yield historical data (ideally in a format and granularity that wasn't fundamentally different from that of the API, so you could join the streams) would definitely be useful, and would probably address most research needs I can think of, inside and outside the Foundation. Thanks for asking, Jonathan On Fri, Jul 29, 2016 at 12:27 PM, Dan Andreescu <[email protected]> wrote: > Amir and Jonathan - thanks for speaking up for the "more than 18 months" > use cases. If dumps were *much* easier to use (via python clients that > made it transparent whether you were hitting the API or not), would that be > an acceptable solution? I feel like both of your use cases are not things > that will be happening on a daily basis. If that's true, another solution > would be an ad-hoc API that took in a filter and a date range, applied it > server-side, and gave you a partial dump with only the interesting data. > If this didn't happen very often, it would allow us to trade processing > time and a bit of dev time for more expensive storage. > > Or, if we end up needing frequent access to old data, we should be able to > justify spending more money on more servers. Just trying to save as much > money as possible :) > > Thanks all so far, please feel free to keep chiming in if you have other > use cases that haven't been covered, or if you'd like to add more weight > behind the "more than 18 months" use cases. > > On Fri, Jul 29, 2016 at 3:18 PM, Leila Zia <[email protected]> wrote: > >> Dan, Thanks for reaching out. >> >> 18 months is enough for my use cases as long as the dumps capture the >> exact data structure. >> >> Best, >> Leila >> >> -- >> Leila Zia >> Senior Research Scientist >> Wikimedia Foundation >> >> On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni < >> [email protected]> wrote: >> >>> I am now checking traffic data every day to see whether Compact Language >>> Links affect it. It makes sense to compare them not only to the previous >>> week, but also to the same month previous year. So one year is not hardly >>> enough. 18 months is better, and three years is much better because I'll be >>> able to check also the same month in earlier years. >>> >>> I imagine that this may be useful to all product managers that work on >>> features that can affect traffic. >>> >>> בתאריך 29 ביולי 2016 15:41, "Dan Andreescu" <[email protected]> >>> כתב: >>> >>>> Dear Pageview API consumers, >>>> >>>> We would like to plan storage capacity for our pageview API cluster. >>>> Right now, with a reliable RAID setup, we can keep *18 months* of >>>> data. If you'd like to query further back than that, you can download dump >>>> files (which we'll make easier to use with python utilities). >>>> >>>> What do you think? Will you need more than 18 months of data? If so, >>>> we need to add more nodes when we get to that point, and that costs money, >>>> so we want to check if there is a real need for it. >>>> >>>> Another option is to start degrading the resolution for older data >>>> (only keep weekly or monthly for data older than 1 year for example). If >>>> you need more than 18 months, we'd love to hear your use case and something >>>> in the form of: >>>> >>>> need daily resolution for 1 year >>>> need weekly resolution for 2 years >>>> need monthly resolution for 3 years >>>> >>>> Thank you! >>>> >>>> Dan >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
