Just curious -- how much would it cost to make all of the data available at a daily granularity for a year?
On Fri, Jul 29, 2016 at 4:30 PM, Jonathan Morgan <[email protected]> wrote: > Hi Dan, > > Making dumps much easier to use would definitely help. We Wikipedia > researchers are kind of spoiled: we have easy public access to historical > revision data for all projects, going back to 2001, through the API *and* > public db endpoints like Quarry. It's only natural that we want the same > thing with pageviews!!! :) > > I can think of other use-cases for keeping more than 18 months of data > available through the API, but they're all research use cases. I don't > think having lower-granularity historical data available beyond a certain > point is helpful for those--if you're doing historical analysis, you want > consistency. But a application that parsed dumps on the server-side to > yield historical data (ideally in a format and granularity that wasn't > fundamentally different from that of the API, so you could join the > streams) would definitely be useful, and would probably address most > research needs I can think of, inside and outside the Foundation. > > Thanks for asking, > Jonathan > > On Fri, Jul 29, 2016 at 12:27 PM, Dan Andreescu <[email protected]> > wrote: > >> Amir and Jonathan - thanks for speaking up for the "more than 18 months" >> use cases. If dumps were *much* easier to use (via python clients that >> made it transparent whether you were hitting the API or not), would that be >> an acceptable solution? I feel like both of your use cases are not things >> that will be happening on a daily basis. If that's true, another solution >> would be an ad-hoc API that took in a filter and a date range, applied it >> server-side, and gave you a partial dump with only the interesting data. >> If this didn't happen very often, it would allow us to trade processing >> time and a bit of dev time for more expensive storage. >> >> Or, if we end up needing frequent access to old data, we should be able >> to justify spending more money on more servers. Just trying to save as >> much money as possible :) >> >> Thanks all so far, please feel free to keep chiming in if you have other >> use cases that haven't been covered, or if you'd like to add more weight >> behind the "more than 18 months" use cases. >> >> On Fri, Jul 29, 2016 at 3:18 PM, Leila Zia <[email protected]> wrote: >> >>> Dan, Thanks for reaching out. >>> >>> 18 months is enough for my use cases as long as the dumps capture the >>> exact data structure. >>> >>> Best, >>> Leila >>> >>> -- >>> Leila Zia >>> Senior Research Scientist >>> Wikimedia Foundation >>> >>> On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni < >>> [email protected]> wrote: >>> >>>> I am now checking traffic data every day to see whether Compact >>>> Language Links affect it. It makes sense to compare them not only to the >>>> previous week, but also to the same month previous year. So one year is not >>>> hardly enough. 18 months is better, and three years is much better because >>>> I'll be able to check also the same month in earlier years. >>>> >>>> I imagine that this may be useful to all product managers that work on >>>> features that can affect traffic. >>>> >>>> בתאריך 29 ביולי 2016 15:41, "Dan Andreescu" <[email protected]> >>>> כתב: >>>> >>>>> Dear Pageview API consumers, >>>>> >>>>> We would like to plan storage capacity for our pageview API cluster. >>>>> Right now, with a reliable RAID setup, we can keep *18 months* of >>>>> data. If you'd like to query further back than that, you can download >>>>> dump >>>>> files (which we'll make easier to use with python utilities). >>>>> >>>>> What do you think? Will you need more than 18 months of data? If so, >>>>> we need to add more nodes when we get to that point, and that costs money, >>>>> so we want to check if there is a real need for it. >>>>> >>>>> Another option is to start degrading the resolution for older data >>>>> (only keep weekly or monthly for data older than 1 year for example). If >>>>> you need more than 18 months, we'd love to hear your use case and >>>>> something >>>>> in the form of: >>>>> >>>>> need daily resolution for 1 year >>>>> need weekly resolution for 2 years >>>>> need monthly resolution for 3 years >>>>> >>>>> Thank you! >>>>> >>>>> Dan >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Jonathan T. Morgan > Senior Design Researcher > Wikimedia Foundation > User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
