Hi Dan,

Making dumps much easier to use would definitely help. We Wikipedia
researchers are kind of spoiled: we have easy public access to historical
revision data for all projects, going back to 2001, through the API *and*
public db endpoints like Quarry. It's only natural that we want the same
thing with pageviews!!! :)

I can think of other use-cases for keeping more than 18 months of data
available through the API, but they're all research use cases. I don't
think having lower-granularity historical data available beyond a certain
point is helpful for those--if you're doing historical analysis, you want
consistency. But a application that parsed dumps on the server-side to
yield historical data (ideally in a format and granularity that wasn't
fundamentally different from that of the API, so you could join the
streams) would definitely be useful, and would probably address most
research needs I can think of, inside and outside the Foundation.

Thanks for asking,
Jonathan

On Fri, Jul 29, 2016 at 12:27 PM, Dan Andreescu <[email protected]>
wrote:

> Amir and Jonathan - thanks for speaking up for the "more than 18 months"
> use cases.  If dumps were *much* easier to use (via python clients that
> made it transparent whether you were hitting the API or not), would that be
> an acceptable solution?  I feel like both of your use cases are not things
> that will be happening on a daily basis.  If that's true, another solution
> would be an ad-hoc API that took in a filter and a date range, applied it
> server-side, and gave you a partial dump with only the interesting data.
> If this didn't happen very often, it would allow us to trade processing
> time and a bit of dev time for more expensive storage.
>
> Or, if we end up needing frequent access to old data, we should be able to
> justify spending more money on more servers.  Just trying to save as much
> money as possible :)
>
> Thanks all so far, please feel free to keep chiming in if you have other
> use cases that haven't been covered, or if you'd like to add more weight
> behind the "more than 18 months" use cases.
>
> On Fri, Jul 29, 2016 at 3:18 PM, Leila Zia <[email protected]> wrote:
>
>> Dan, Thanks for reaching out.
>>
>> 18 months is enough for my use cases as long as the dumps capture the
>> exact data structure.
>>
>> Best,
>> Leila
>>
>> --
>> Leila Zia
>> Senior Research Scientist
>> Wikimedia Foundation
>>
>> On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni <
>> [email protected]> wrote:
>>
>>> I am now checking traffic data every day to see whether Compact Language
>>> Links affect it. It makes sense to compare them not only to the previous
>>> week, but also to the same month previous year. So one year is not hardly
>>> enough. 18 months is better, and three years is much better because I'll be
>>> able to check also the same month in earlier years.
>>>
>>> I imagine that this may be useful to all product managers that work on
>>> features that can affect traffic.
>>>
>>> בתאריך 29 ביולי 2016 15:41,‏ "Dan Andreescu" <[email protected]>
>>> כתב:
>>>
>>>> Dear Pageview API consumers,
>>>>
>>>> We would like to plan storage capacity for our pageview API cluster.
>>>> Right now, with a reliable RAID setup, we can keep *18 months* of
>>>> data.  If you'd like to query further back than that, you can download dump
>>>> files (which we'll make easier to use with python utilities).
>>>>
>>>> What do you think?  Will you need more than 18 months of data?  If so,
>>>> we need to add more nodes when we get to that point, and that costs money,
>>>> so we want to check if there is a real need for it.
>>>>
>>>> Another option is to start degrading the resolution for older data
>>>> (only keep weekly or monthly for data older than 1 year for example).  If
>>>> you need more than 18 months, we'd love to hear your use case and something
>>>> in the form of:
>>>>
>>>> need daily resolution for 1 year
>>>> need weekly resolution for 2 years
>>>> need monthly resolution for 3 years
>>>>
>>>> Thank you!
>>>>
>>>> Dan
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to