Just curious -- how much would it cost to make all of the data available at
a daily granularity for a year?

On Fri, Jul 29, 2016 at 4:30 PM, Jonathan Morgan <[email protected]>
wrote:

> Hi Dan,
>
> Making dumps much easier to use would definitely help. We Wikipedia
> researchers are kind of spoiled: we have easy public access to historical
> revision data for all projects, going back to 2001, through the API *and*
> public db endpoints like Quarry. It's only natural that we want the same
> thing with pageviews!!! :)
>
> I can think of other use-cases for keeping more than 18 months of data
> available through the API, but they're all research use cases. I don't
> think having lower-granularity historical data available beyond a certain
> point is helpful for those--if you're doing historical analysis, you want
> consistency. But a application that parsed dumps on the server-side to
> yield historical data (ideally in a format and granularity that wasn't
> fundamentally different from that of the API, so you could join the
> streams) would definitely be useful, and would probably address most
> research needs I can think of, inside and outside the Foundation.
>
> Thanks for asking,
> Jonathan
>
> On Fri, Jul 29, 2016 at 12:27 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> Amir and Jonathan - thanks for speaking up for the "more than 18 months"
>> use cases.  If dumps were *much* easier to use (via python clients that
>> made it transparent whether you were hitting the API or not), would that be
>> an acceptable solution?  I feel like both of your use cases are not things
>> that will be happening on a daily basis.  If that's true, another solution
>> would be an ad-hoc API that took in a filter and a date range, applied it
>> server-side, and gave you a partial dump with only the interesting data.
>> If this didn't happen very often, it would allow us to trade processing
>> time and a bit of dev time for more expensive storage.
>>
>> Or, if we end up needing frequent access to old data, we should be able
>> to justify spending more money on more servers.  Just trying to save as
>> much money as possible :)
>>
>> Thanks all so far, please feel free to keep chiming in if you have other
>> use cases that haven't been covered, or if you'd like to add more weight
>> behind the "more than 18 months" use cases.
>>
>> On Fri, Jul 29, 2016 at 3:18 PM, Leila Zia <[email protected]> wrote:
>>
>>> Dan, Thanks for reaching out.
>>>
>>> 18 months is enough for my use cases as long as the dumps capture the
>>> exact data structure.
>>>
>>> Best,
>>> Leila
>>>
>>> --
>>> Leila Zia
>>> Senior Research Scientist
>>> Wikimedia Foundation
>>>
>>> On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni <
>>> [email protected]> wrote:
>>>
>>>> I am now checking traffic data every day to see whether Compact
>>>> Language Links affect it. It makes sense to compare them not only to the
>>>> previous week, but also to the same month previous year. So one year is not
>>>> hardly enough. 18 months is better, and three years is much better because
>>>> I'll be able to check also the same month in earlier years.
>>>>
>>>> I imagine that this may be useful to all product managers that work on
>>>> features that can affect traffic.
>>>>
>>>> בתאריך 29 ביולי 2016 15:41,‏ "Dan Andreescu" <[email protected]>
>>>> כתב:
>>>>
>>>>> Dear Pageview API consumers,
>>>>>
>>>>> We would like to plan storage capacity for our pageview API cluster.
>>>>> Right now, with a reliable RAID setup, we can keep *18 months* of
>>>>> data.  If you'd like to query further back than that, you can download 
>>>>> dump
>>>>> files (which we'll make easier to use with python utilities).
>>>>>
>>>>> What do you think?  Will you need more than 18 months of data?  If so,
>>>>> we need to add more nodes when we get to that point, and that costs money,
>>>>> so we want to check if there is a real need for it.
>>>>>
>>>>> Another option is to start degrading the resolution for older data
>>>>> (only keep weekly or monthly for data older than 1 year for example).  If
>>>>> you need more than 18 months, we'd love to hear your use case and 
>>>>> something
>>>>> in the form of:
>>>>>
>>>>> need daily resolution for 1 year
>>>>> need weekly resolution for 2 years
>>>>> need monthly resolution for 3 years
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Dan
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to