from:"Marcel Ruiz Forns"

[Analytics] Re: WikimediaAQS Pageview API Data Integrity

2023-08-24 Thread Marcel Ruiz Forns

Hi Duncan,

Thank you for reaching out! And sorry for the late reply.

We don't have a periodic scheduled process that corrects or backfills the
Analytics Query Service API.
However, sometimes there are unexpected issues either in the underlying
data or in the systems that compute the API data.
When this happens, we do indeed correct the data as soon as possible.
Sometimes we manage to correct it before it reaches the API; otherwise, as
you imagined, we have to reload the corrected data to the API after the
fact.

Cheers!

On Tue, Aug 15, 2023 at 10:28 PM Duncan Grubbs  wrote:

> Hello,
>
> After experiencing some strange behavior re-fetching pageview data, I am
> wondering if it is possible that the daily pageview count for an article
> could change *after* the data is originally published to the API.
>
> For example, if I fetch the daily pageviews on an article for the date
> 14-08-23, and then re-fetch the daily pageviews for the same article in the
> future, is it expected that the value for 14-08-23 could be different?
>
> Is there a backfill or correction process that can update daily pageview
> counts for days that are already available via the API?
>
> Any information is appreciated!
>
> Thanks,
> Duncan
>
>
> --
>
>
> Duncan Grubbs
>
> Software Engineer
>
> he/him
>
> E: dun...@predata.com 
>
> Time Zone: ET (UTC-5/-4)
>
> predata.com <https://www.predata.com/>
>
> ___
> Analytics mailing list -- analytics@lists.wikimedia.org
> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>

-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: Wikimedia AQS Pageviews API - 2023-06-19

2023-06-20 Thread Marcel Ruiz Forns

Hi Ben,

We've had an issue with the processing of a particular hour (2023-06-19T17)
of the webrequest dataset, which is the root of AQS Pageviews API as well
as other derived datasets. This has delayed all those dependent jobs.
The issue has been fixed a couple of hours ago, though, which means the
dependent jobs should run immediately if they have not run already.
Actually, I can see data in AQS for i.e.:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2023061900/2023062000
Sorry for the inconvenience!

Cheers

On Tue, Jun 20, 2023 at 7:37 PM Ben Smith  wrote:

> Hi all,
>
> It seems like the Wikimedia AQS Pageviews API isn't returning data for
> yesterday (2023-06-19). Is there any update on when that data will
> be available?
>
> Thanks,
> Ben
> ___
> Analytics mailing list -- analytics@lists.wikimedia.org
> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>

-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: API Outages

2023-03-03 Thread Marcel Ruiz Forns

Hi Joshua,
today's delay is unrelated to the problem mentioned by Andrew,
it was due to a silent issue with our scheduling system. We are
investigating it.
In the meantime, the data is now available.
Sorry for the inconvenience!

On Fri, Mar 3, 2023 at 12:38 PM Joshua Haecker  wrote:

> Thanks for this Andrew! Sadly it appears to be down again, did we try the
> patch again? Thanks in advance for any details you can provide.
>
> On Fri, Feb 24, 2023 at 2:50 PM Andrew Otto  wrote:
>
>> Hello!
>>
>> Here is what I know.
>> - 2023-02-22T14:33 UTC  - This patch
>> <https://gerrit.wikimedia.org/r/c/operations/puppet/+/890843/3#message-dcf8af5c4e17a81ad79d16fa6cd7aad504432f87>
>> was merged
>> - Over the next several hours systemd timers (cron jobs) across a lot of
>> the fleet stopped running.
>> - 2023-02-22T17:34 UTC - This patch
>> <https://gerrit.wikimedia.org/r/c/operations/puppet/+/891340/> reverted
>> the change
>> After this jobs began to run again.  However, because the webrequest
>> dataset is so huge, it took hours for ingestion of it to catch up.
>> Downstream jobs that use webrequest as input (including pageviews
>> computation) began to timeout while waiting for input.
>> - We have been slowly restarting and recovering jobs now that webrequest
>> ingestion has caught up again.
>>
>> I don't know exactly how long data is delayed or when it will be fully
>> available, but I'd guess: soon / today?
>>
>> -Andrew Otto
>>  WMF
>>
>>
>> On Fri, Feb 24, 2023 at 7:10 AM Joshua Haecker  wrote:
>>
>>> Another day another delay, is there a known broader issue? Just trying
>>> to provide reasonable expectations of when data might be available. Thanks
>>>
>>> On Thu, Feb 23, 2023 at 9:55 AM Joshua Haecker  wrote:
>>>
>>>> Hi all,
>>>>
>>>> Just curious if there is a known cause for the multiple long delays
>>>> we've had on the AQS API data being available this week? I know periodic
>>>> delays are not uncommon but these seem beyond normal levels.
>>>>
>>>> Thanks!
>>>>
>>>> ~Josh
>>>>
>>>> --
>>> Joshua Haecker
>>> CEO, Co-Founder
>>> Predata, Inc.
>>>
>>> j...@predata.com
>>> 609-865-2181
>>> 1201 Pennsylvania Ave NW
>>> <https://www.google.com/maps/search/1201+Pennsylvania+Ave+NW+Washington,+D.C.?entry=gmail=g>
>>> Washington, D.C.
>>> <https://www.google.com/maps/search/1201+Pennsylvania+Ave+NW+Washington,+D.C.?entry=gmail=g>
>>>
>>> *This message and its contents are confidential. If you received this
>>> message in error, do not use or rely upon it. Instead, please inform the
>>> sender and then delete it. Thank you.*
>>>
>> ___
>>> Analytics mailing list -- analytics@lists.wikimedia.org
>>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>>>
>> ___
>> Analytics mailing list -- analytics@lists.wikimedia.org
>> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>>
> --
> Joshua Haecker
> CEO, Co-Founder
> Predata, Inc.
>
> j...@predata.com
> 609-865-2181
> 1201 Pennsylvania Ave NW
> Washington, D.C.
>
> *This message and its contents are confidential. If you received this
> message in error, do not use or rely upon it. Instead, please inform the
> sender and then delete it. Thank you.*
> ___
> Analytics mailing list -- analytics@lists.wikimedia.org
> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>


-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: Access Wikipedia Metadata - API/Dumps/Query Replicas?

2021-09-17 Thread Marcel Ruiz Forns

Hi Cristina!

In regards to the question:

Last thing, in the pageview archive there are three types of file:
> automated, spider and user.   Am I right in understanding that "user"
> relates to pageviews operated by real persons, while "automated" and
> "spider" by programs (not sure about the difference between the two)?


Yes, "user" relates to pageviews operated by real people. "Spider"
pageviews are those generated by self-declared bots, the ones that are
identified as such in their UserAgent header (for instance web crawlers).
"Automated" pageviews are those generated by bots that are not identified
as such. They are labelled separately because we use different methods for
labelling them: the spider pageviews are identified by parsing the
UserAgent string, and the automated ones are identified with request
pattern heuristics.

Hope this helps!


On Fri, Sep 17, 2021 at 5:47 PM Cristina Gava via Analytics <
analytics@lists.wikimedia.org> wrote:

> Hi Dan,
>
> Thanks a lot. I think I bumped into that link at some point and then I
> wasn't able to come across it again.
> There is a point that is not entirely clear to me
>
> "Thus, note that incremental downloads of these dumps may generate
> inconsistent data. Consider using EventStreams for real time updates on
> MediaWiki changes (API docs)."
>
> I am planning to retrieve updated versions of the metadata regularly. So I
> guess I have to use EventStream to access the recent changes? AFAIU there
> recent changes come from the RecentChanges table [1]. So what would be a
> proper stream of actions? For example:
>
> 1. Dowload the mediawiki_history dump once and parse it
> 2. For every new update of my data pool, access recent changes through
> event stream as per [2]
>
> Did understand this correctly?
>
> Last thing, in the pageview archive there are three types of file:
> automated, spider and user.   Am I right in understanding that "user"
> relates to pageviews operated by real persons, while "automated" and
> "spider" by programs (not sure about the difference between the two)?
>
> Cristina
>
> [1] https://www.mediawiki.org/wiki/Manual:Recentchanges_table
> [2] https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams
> ___
> Analytics mailing list -- analytics@lists.wikimedia.org
> To unsubscribe send an email to analytics-le...@lists.wikimedia.org
>


-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] [Data Release] Editors by Country in AQS

2020-09-22 Thread Marcel Ruiz Forns

Hi everyone!

We're announcing the *release of an API endpoint* that has been requested
for a long time: *Editors by country*.
This data set was already made public in the form of dumps last November
(see email
<https://lists.wikimedia.org/pipermail/analytics/2019-November/006702.html>).
And now it's available via the Analytics Query Service API. Check it out:

   - Get *last month*'s breakdown of *active editors* (5-99 edits) for
*Portuguese
   Wikipedia*:

   
https://wikimedia.org/api/rest_v1/metrics/editors/by-country/pt.wikipedia/5..99-edits/2020/08

   - Get *last January*'s breakdown of *very active editors* (100+ edits)
   for *English Wikipedia*:

   
https://wikimedia.org/api/rest_v1/metrics/editors/by-country/en.wikipedia/100..-edits/2020/01

Two-letter ISO country codes
<https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes> are used for
breakdowns.
For more detailed information, refer to the full documentation of the API
endpoint
<https://wikitech.wikimedia.org/wiki/Analytics/AQS/Editors_by_country>, or
the documentation of the underlying data
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors/Public>
.

As a next step, we'll add the corresponding visualization to Wikistats2
<http://stats.wikimedia.org>.

Cheers!

On behalf of the Analytics team,
-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Computed Edit Counts vs Wikistats Edit Counts

2020-09-11 Thread Marcel Ruiz Forns

Hi Thorsten!

Did you just filter out the editors marked as bots via a userGroup?

We also filter out some editors by username, because some bots are not
marked as such via a userGroup. The regular expression we use is this one
(IIRC):
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/user/UserEventBuilder.scala#L24

Not sure that's the only source of discrepancy, but could be! Please, let
us know.

thanks!


On Fri, Sep 11, 2020 at 4:22 PM Thorsten Ruprechter 
wrote:

> Hello,
>
> I have a question about the "User edits" metric presented on Wikistats,
> and would be very grateful for advice regarding an issue we encountered.
>
> We are currently computing some edit metrics for multiple Wikipedia
> language versions. However, we realized there is some discrepancy between
> our edit count results and the ones reported on Wikistats. It seems that
> total edit counts are higher for our data, while trends for daily edits are
> also different. As an example, the French Wikipedia:
>
> Wikistats:
>
> https://stats.wikimedia.org/#/fr.wikipedia.org/contributing/user-edits/normal|line|2020-01-01~2020-05-16|page_type~content|daily
>
> Our results (see attachment):
>
>
>
> We removed all users marked as bots in the database, and excluded edits to
> talk pages, as it is done with the Wikistats edit count metric. I just now
> found this note [1]: "The original Wikistats did not count edits if the
> page they were made on was deleted. We are doing the same thing in
> Wikistats 2 for now, which means you may see metric totals shifting over
> time (as pages are deleted)."
>
> Could this be what is causing this rift, or are there other processing
> details which we have to consider to reproduce the Wikistats numbers as
> closely as possible? On a separate note - are the daily edit counts for all
> pages (including deleted articles) accessible somewhere?
>
> thanks, thorsten
>
> [1] https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Edits
>
> --
> Thorsten Ruprechter
>
> Institute of Interactive Systems and Data Science (ISDS)
> Graz University of Technology, Austria
>
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
*Marcel Ruiz Forns** (he/him)*
Senior Software Engineer
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Research-Internal] Tutorials on disk space usage for notebook/stat boxes

2020-02-18 Thread Marcel Ruiz Forns

Looks great Luca!
Handy commands...

On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano  wrote:

> Hi everybody!
>
> I created the following doc:
> https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nodes
>
> It contains two FAQ:
> - How do I ensure that there is enough space on disk before storing big
> datasets/files ?
> - How do I check the space used by my files/data on stat/notebook hosts ?
>
> Please read them and let me know if anything is not clear or missing. We
> have plenty of space on stat100X hosts, but we tend to cluster on single
> machines like stat1007 for some reason, ending up in fighting for resources.
>
> On a related note, we are going to work on unifying stat/notebook puppet
> configs in https://phabricator.wikimedia.org/T243934, so eventually all
> Analytics clients will be exactly the same.
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
>
> ___
> Research-Internal mailing list
> research-inter...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>


-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Pageviews API missing data for some pages and dates?

2020-01-02 Thread Marcel Ruiz Forns

Hi Vipul!
Thanks for letting us know about this.
This is indeed a problem. And I think it's related to the + special
character in the title of the page.
I checked general traffic for English Wikipedia, and it looks OK to me.
But then I checked other pages with the same + character in them, and they
show the same pattern.
They stop somewhere in the middle of April 24th and come back in the middle
of June 6th.
I created a task for this, we'll be prioritizing it soon.
See: https://phabricator.wikimedia.org/T241734
Thanks a lot!

On Wed, Jan 1, 2020 at 6:39 PM Vipul Naik  wrote:

> I was trying to get pageviews data for the Travel + Leisure Wikipedia page
> https://en.wikipedia.org/wiki/Travel_%2B_Leisure
>
> It seems like the data is missing for the month of May on desktop. In
> particular, this link returns a Not found error:
>
>
> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/desktop/user/Travel_%2B_Leisure/daily/20190501/20190531?purge1328419450
>
> The corresponding links for April and June return data, but the last few
> days of April and the first few days of June are missing:
>
>
> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/desktop/user/Travel_%2B_Leisure/daily/20190601/20190630?purge1595833545
>  (data
> is missing for June 1 to 5 but present June 6 onward)
>
>
> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/desktop/user/Travel_%2B_Leisure/daily/20190401/20190430?purge1328419450
>  (data
> is missing for April 25 onward)
>
> The same is true on mobile-web.
>
> I thought it's possible the article was deleted and then reinstated, but
> the revision history doesn't suggest any changes during the time period,
> and there is no update on the talk page and nothing in the deletion log.
>
> Any ideas?
>
> I've also noticed the pageviews API occasionally omitting data for a few
> days for other queries, though a re-query usually works to fill in the
> missing data. For instance,
> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/desktop/user/Alcohol_and_cancer/daily/20191101/20191130
>  originally
> returned no results for me but on a re-query I was able to get results.
> I'll share more information on this in a separate email if I'm able to
> reproduce.
>
> Thank you,
>
> Vipul
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Hive and Oozie unavailable due to maintenance on Tue Jul 30th 10am CEST

2019-07-29 Thread Marcel Ruiz Forns

Hi all!

The Analytics team is planning to upgrade OpenJDK in our Hadoop cluster (
https://phabricator.wikimedia.org/T229003) tomorrow Tuesday 30th of July at
10am CEST.
Hive and Oozie will be unavailable for 10 to 15 minutes, and any ongoing
Oozie jobs or Hive (beeline) queries will be interrupted (we'll let the
outstanding ones finish, if possible).

If this will break some important job that you have running, please let us
know in the Phabricator task above or via IRC (#wikimedia-analytics).

Cheers!

Marcel (on behalf of the Analytics team)

-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WMF API update

2019-05-06 Thread Marcel Ruiz Forns

Hi Celeste,
Thanks for pinging us about that!

I noticed the edits endpoint has been updated to limit the date range to
> about 367 days’ worth of data per request.

The limit on the time range length was placed on purpose to avoid receiving
very long requests.
Requests of several years worth of data can acquire too much of the API
server power and block other requests.

will I just have to request sequences of the shorter date ranges?

Yes, please, sorry for the inconvenience!

Cheers!

On Sat, May 4, 2019 at 7:01 PM Celeste A Manughian-Peter <
celeste.manughian-pe...@aero.org> wrote:

> Hello!
>
>
>
> I had set up some endpoints from the wikimedia API a while back and it has
> been running smoothly in my project since then. I noticed the edits
> endpoint has been updated to limit the date range to about 367 days’ worth
> of data per request.
> For example:
>
> https://wikimedia.org/api/rest_v1/metrics/edits/per-page/en.wikipedia/Europe/all-editor-types/monthly/20170502/20190502
>
> vs.
>
>
> https://wikimedia.org/api/rest_v1/metrics/edits/per-page/en.wikipedia/Europe/all-editor-types/monthly/20180502/20190502
> <https://wikimedia.org/api/rest_v1/metrics/edits/per-page/en.wikipedia/Europe/all-editor-types/monthly/20170502/20190502>
>
>
>
> Is there still a way to get longer periods of data through this endpoint
> or will I just have to request sequences of the shorter date ranges?
>
>
>
> Thanks a bunch!
>
> Celeste
>
>
>
>
>
> Celeste Manughian-Peter
> Data Science and Artificial Intelligence Department
> The Aerospace Corporation
> 310.336.6928
>
> *celeste.manughian-pe...@aero.org *
>
>
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Please add Chinese Wikiversity into the WikiStats database

2019-01-11 Thread Marcel Ruiz Forns

>
> Thanks for your replies! However, I have few more questions to ask for.
> Sorry!

No problem! :]


> You have mentioned that the analytics team is about to abandon the
> original WikiStats website. Is that means even if you successfully update
> the database (or pipelines), the data will still not be shown in WikiStats
> 1?

Correct, we are from now on giving only critical maintenance to WikiStats1.
So, even if we update the Analytics pipeline, no new data will be available
in WikiStats1.

Secondly, what I (and the community) need is just basic statistics (of
> course, EVERY categories included in WikiStat 1 will be much better).

Understand. I would suggest that, mid February, you check WikiStats2 (and
Analytics Query Service) for Chinese Wikiversity and determine what stats
are missing for you. Then you could make us know, and we would take that
into account when developing new features for WikiStats2.

Finally, if you finish the improvements, will the data dated before the
> improvements (ex. 2018-08) also be visible?

Yes, we should be able to calculate editing metrics since the beginning of
wiki-time.

Cheers!



On Fri, Jan 11, 2019 at 5:36 AM Eric Liu  wrote:

> Thanks for your replies! However, I have few more questions to ask for.
> Sorry!
>
> You have mentioned that the analytics team is about to abandon the
> original WikiStats website. Is that means even if you successfully update
> the database (or pipelines), the data will still not be shown in WikiStats
> 1?
>
> Secondly, what I (and the community) need is just basic statistics (of
> course, EVERY categories included in WikiStat 1 will be much better).
>
> Finally, if you finish the improvements, will the data dated before the
> improvements (ex. 2018-08) also be visible?
>
> Thanks for your help!
>
> Marcel Ruiz Forns 於 2019年1月11日 週五，02:36寫道：
>
>> If the analytics team add the data of Chinese Wikiversity into the
>>> database (base source), will WikiStats and WikiStats 2 both get updated? If
>>> not, then what can I do to fix it?
>>
>>
>> The data is already present in the initial wiki databases, but was not
>> being pulled by the Analytics pipeline that generates stats for WikiStats2.
>> Thanks to your heads-up, we already fixed that, see:
>> https://phabricator.wikimedia.org/T213290. However, it will only reflect
>> in WikiStats2 after the next round of data loading, which will take place
>> between the 5th and 10th of next month (Feb 2019).
>>
>> (Although WikiStats 2 has more advanced interface, it’s still in
>>> development (not stable enough), and the original WikiStats website has a
>>> much simpler interface to navigate and collect raw data. Yet, I prefer to
>>> use WikiStats 1 than WikiStats 2 as the reference for statistics.)
>>
>>
>> Yes, WikiStats2 is under development and will be for a while. We're
>> adding WikiStats1 functionalities to it as time allows. Unfortunately,
>> we're not actively working on WikiStats1 new features any more, only on
>> fixing critical errors. Now, if you're looking for raw data (as opposed of
>> data visualization), the Analytics API that I mentioned in my first reply
>> (Analytics Query Service) might have what you want (next month). Also, if
>> you want to tell us exactly what data are you looking for, we might be able
>> to help you get it; or in case we don't have it available yet, it will aid
>> us in determining which features should we add next to WikiStats2 in the
>> upcoming months.
>>
>> Cheers!
>>
>> On Thu, Jan 10, 2019 at 4:28 PM Eric Liu  wrote:
>>
>>> Thanks for your initial answer. One more question please. (Orz)
>>>
>>> If the analytics team add the data of Chinese Wikiversity into the
>>> database (base source), will WikiStats and WikiStats 2 both get updated? If
>>> not, then what can I do to fix it?
>>>
>>> (Although WikiStats 2 has more advanced interface, it’s still in
>>> development (not stable enough), and the original WikiStats website has a
>>> much simpler interface to navigate and collect raw data. Yet, I prefer to
>>> use WikiStats 1 than WikiStats 2 as the reference for statistics.)
>>>
>>> Again, thanks for your precious answer! It’s really helpful for both me
>>> and the Chinese Wikiversity community.
>>>
>>> Marcel Ruiz Forns 於 2019年1月10日 週四，23:09寫道：
>>>
>>>> [adding back analytics list to recipients]
>>>>
>>>> Hi Eric!
>>>>
>>>> Are WikiStats 1 and WikiStats 2’s database the same?
>>>>
>>>>
>>>> Although the initial source of data i

Re: [Analytics] Please add Chinese Wikiversity into the WikiStats database

2019-01-10 Thread Marcel Ruiz Forns

>
> If the analytics team add the data of Chinese Wikiversity into the
> database (base source), will WikiStats and WikiStats 2 both get updated? If
> not, then what can I do to fix it?


The data is already present in the initial wiki databases, but was not
being pulled by the Analytics pipeline that generates stats for WikiStats2.
Thanks to your heads-up, we already fixed that, see:
https://phabricator.wikimedia.org/T213290. However, it will only reflect in
WikiStats2 after the next round of data loading, which will take place
between the 5th and 10th of next month (Feb 2019).

(Although WikiStats 2 has more advanced interface, it’s still in
> development (not stable enough), and the original WikiStats website has a
> much simpler interface to navigate and collect raw data. Yet, I prefer to
> use WikiStats 1 than WikiStats 2 as the reference for statistics.)


Yes, WikiStats2 is under development and will be for a while. We're adding
WikiStats1 functionalities to it as time allows. Unfortunately, we're not
actively working on WikiStats1 new features any more, only on fixing
critical errors. Now, if you're looking for raw data (as opposed of data
visualization), the Analytics API that I mentioned in my first reply
(Analytics Query Service) might have what you want (next month). Also, if
you want to tell us exactly what data are you looking for, we might be able
to help you get it; or in case we don't have it available yet, it will aid
us in determining which features should we add next to WikiStats2 in the
upcoming months.

Cheers!

On Thu, Jan 10, 2019 at 4:28 PM Eric Liu  wrote:

> Thanks for your initial answer. One more question please. (Orz)
>
> If the analytics team add the data of Chinese Wikiversity into the
> database (base source), will WikiStats and WikiStats 2 both get updated? If
> not, then what can I do to fix it?
>
> (Although WikiStats 2 has more advanced interface, it’s still in
> development (not stable enough), and the original WikiStats website has a
> much simpler interface to navigate and collect raw data. Yet, I prefer to
> use WikiStats 1 than WikiStats 2 as the reference for statistics.)
>
> Again, thanks for your precious answer! It’s really helpful for both me
> and the Chinese Wikiversity community.
>
> Marcel Ruiz Forns 於 2019年1月10日 週四，23:09寫道：
>
>> [adding back analytics list to recipients]
>>
>> Hi Eric!
>>
>> Are WikiStats 1 and WikiStats 2’s database the same?
>>
>>
>> Although the initial source of data is the same for both WikiStats1 and
>> Wikistats2 (the wiki databases), WikiStats1 and WikiStats2 pull data from
>> different pipelines. WikiStats1 independently computes metrics monthly and
>> stores them in static html files, which then are served as
>> stats.wikimedia.org. WikiStats2 is a serving layer on top of the
>> Analytics data pipeline. It pulls data from Analytics Query Service[1], the
>> stats API maintained by us (Analytics team). It's a public service, so you
>> can query it freely. See manuals[2][3][4]. Note that in most cases data
>> from WikiStats1 matches data from WikiStats2, but some metrics can slightly
>> differ for technical reasons.
>>
>> And, is WikiScan a part of WikiStats?
>>
>>
>> No, I think WikiScan is a completely separate tool, though it probably
>> shares the same initial source of data than the WikiStats siblings.
>>
>> [1] https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS
>> [2] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
>> [3] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
>> [4] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Unique_Devices
>>
>> On Thu, Jan 10, 2019 at 10:34 AM Eric Liu  wrote:
>>
>>> Are WikiStats 1 and WikiStats 2’s database the same? And, is WikiScan a
>>> part of WikiStats?
>>>
>>> Thanks for your help!
>>>
>>> Marcel Ruiz Forns 於 2019年1月9日 週三，23:34寫道：
>>>
>>>> [Adding Eric Liu to the recipient list, because he is not yet
>>>> subscribed to the list]
>>>>
>>>> Hi Eric!
>>>>
>>>> Thank you for the heads up. We will work on fixing that.
>>>> You can follow the progress of this task here:
>>>> https://phabricator.wikimedia.org/T213290
>>>>
>>>> BTW, please subscribe to the list here, so that you messages do not get
>>>> blocked for moderation.
>>>> Also, you will be able to receive all replies to your message. Thanks!
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Jan 9, 2019 at 4:26 PM Eric Liu  wrote:
>>>>
>>>>> The Chinese Wikiversity project had been lau

Re: [Analytics] Please add Chinese Wikiversity into the WikiStats database

2019-01-10 Thread Marcel Ruiz Forns

[adding back analytics list to recipients]

Hi Eric!

Are WikiStats 1 and WikiStats 2’s database the same?

Although the initial source of data is the same for both WikiStats1 and
Wikistats2 (the wiki databases), WikiStats1 and WikiStats2 pull data from
different pipelines. WikiStats1 independently computes metrics monthly and
stores them in static html files, which then are served as
stats.wikimedia.org. WikiStats2 is a serving layer on top of the Analytics
data pipeline. It pulls data from Analytics Query Service[1], the stats API
maintained by us (Analytics team). It's a public service, so you can query
it freely. See manuals[2][3][4]. Note that in most cases data from
WikiStats1 matches data from WikiStats2, but some metrics can slightly
differ for technical reasons.

And, is WikiScan a part of WikiStats?

No, I think WikiScan is a completely separate tool, though it probably
shares the same initial source of data than the WikiStats siblings.

[1] https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS
[2] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
[3] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
[4] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Unique_Devices

On Thu, Jan 10, 2019 at 10:34 AM Eric Liu  wrote:

> Are WikiStats 1 and WikiStats 2’s database the same? And, is WikiScan a
> part of WikiStats?
>
> Thanks for your help!
>
> Marcel Ruiz Forns 於 2019年1月9日 週三，23:34寫道：
>
>> [Adding Eric Liu to the recipient list, because he is not yet subscribed
>> to the list]
>>
>> Hi Eric!
>>
>> Thank you for the heads up. We will work on fixing that.
>> You can follow the progress of this task here:
>> https://phabricator.wikimedia.org/T213290
>>
>> BTW, please subscribe to the list here, so that you messages do not get
>> blocked for moderation.
>> Also, you will be able to receive all replies to your message. Thanks!
>>
>> Cheers
>>
>> On Wed, Jan 9, 2019 at 4:26 PM Eric Liu  wrote:
>>
>>> The Chinese Wikiversity project had been launched for several months,
>>> and it already has over 700 learning resources, surpassing Swedish
>>> Wikiversity and Korean Wikiversity, which shows that the project has a
>>> stable community.
>>>
>>> However, the WikiStats website hasn’t been updated yet, which makes the
>>> community difficult to track the data.
>>>
>>> Please add Chinese Wikiversity into the WikiStats database as soon as
>>> possible. We need, and will appreciate your help.
>>>
>>>   Sincerely,
>>>Eric Liu (User:Ericliu1912) from Chinese Wikiversity
>>>
>> --
>>> 劉洺辰 敬上
>>> Sincerely, Eric Liu
>>> _______
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> --
>> *Marcel Ruiz Forns** (he/him)*
>> Analytics Developer @ Wikimedia Foundation
>>
> --
> 劉洺辰 敬上
> Sincerely, Eric Liu
>

-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Please add Chinese Wikiversity into the WikiStats database

2019-01-09 Thread Marcel Ruiz Forns

[Adding Eric Liu to the recipient list, because he is not yet subscribed to
the list]

Hi Eric!

Thank you for the heads up. We will work on fixing that.
You can follow the progress of this task here:
https://phabricator.wikimedia.org/T213290

BTW, please subscribe to the list here, so that you messages do not get
blocked for moderation.
Also, you will be able to receive all replies to your message. Thanks!

Cheers

On Wed, Jan 9, 2019 at 4:26 PM Eric Liu  wrote:

> The Chinese Wikiversity project had been launched for several months, and
> it already has over 700 learning resources, surpassing Swedish Wikiversity
> and Korean Wikiversity, which shows that the project has a stable
> community.
>
> However, the WikiStats website hasn’t been updated yet, which makes the
> community difficult to track the data.
>
> Please add Chinese Wikiversity into the WikiStats database as soon as
> possible. We need, and will appreciate your help.
>
>   Sincerely,
>Eric Liu (User:Ericliu1912) from Chinese Wikiversity
> --
> 劉洺辰 敬上
> Sincerely, Eric Liu
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-15 Thread Marcel Ruiz Forns

gt; EchoInteraction
>>>>>>> EchoMail
>>>>>>> EditAttemptStep
>>>>>>> EditConflict
>>>>>>> EditorActivation
>>>>>>> EUCCVisit
>>>>>>> EventError
>>>>>>> FlowReplies
>>>>>>> GettingStartedRedirectImpression
>>>>>>> GuidedTourButtonClick
>>>>>>> GuidedTourExited
>>>>>>> GuidedTourExternalLinkActivation
>>>>>>> GuidedTourGuiderHidden
>>>>>>> GuidedTourGuiderImpression
>>>>>>> LandingPageImpression
>>>>>>> MediaViewer
>>>>>>> MediaWikiPingback
>>>>>>> MobileAppCategorizationAttempts
>>>>>>> MobileAppUploadAttempts
>>>>>>> MobileWebMainMenuClickTracking
>>>>>>> MobileWebSearch
>>>>>>> MobileWebUIClickTracking
>>>>>>> MobileWikiAppAppearanceSettings
>>>>>>> MobileWikiAppArticleSuggestions
>>>>>>> MobileWikiAppCreateAccount
>>>>>>> MobileWikiAppDailyStats
>>>>>>> MobileWikiAppEdit
>>>>>>> MobileWikiAppFeed
>>>>>>> MobileWikiAppFeedConfigure
>>>>>>> MobileWikiAppFindInPage
>>>>>>> MobileWikiAppInstallReferrer
>>>>>>> MobileWikiAppIntents
>>>>>>> MobileWikiAppiOSFeed
>>>>>>> MobileWikiAppiOSLoginAction
>>>>>>> MobileWikiAppiOSReadingLists
>>>>>>> MobileWikiAppiOSSearch
>>>>>>> MobileWikiAppiOSSessions
>>>>>>> MobileWikiAppiOSSettingAction
>>>>>>> MobileWikiAppiOSUserHistory
>>>>>>> MobileWikiAppLangSelect
>>>>>>> MobileWikiAppLanguageSearching
>>>>>>> MobileWikiAppLanguageSettings
>>>>>>> MobileWikiAppLinkPreview
>>>>>>> MobileWikiAppLogin
>>>>>>> MobileWikiAppMediaGallery
>>>>>>> MobileWikiAppNavMenu
>>>>>>> MobileWikiAppNotificationInteraction
>>>>>>> MobileWikiAppNotificationPreferences
>>>>>>> MobileWikiAppOfflineLibrary
>>>>>>> MobileWikiAppOnboarding
>>>>>>> MobileWikiAppOnThisDay
>>>>>>> MobileWikiAppPageScroll
>>>>>>> MobileWikiAppProtectedEditAttempt
>>>>>>> MobileWikiAppRandomizer
>>>>>>> MobileWikiAppReadingLists
>>>>>>> MobileWikiAppSavedPages
>>>>>>> MobileWikiAppSearch
>>>>>>> MobileWikiAppSessions
>>>>>>> MobileWikiAppShareAFact
>>>>>>> MobileWikiAppStuffHappens
>>>>>>> MobileWikiAppTabs
>>>>>>> MobileWikiAppToCInteraction
>>>>>>> MobileWikiAppWidgets
>>>>>>> MobileWikiAppWiktionaryPopup
>>>>>>> MultimediaViewerAttribution
>>>>>>> MultimediaViewerDimensions
>>>>>>> MultimediaViewerDuration
>>>>>>> MultimediaViewerNetworkPerformance
>>>>>>> NavigationTiming
>>>>>>> PageIssues
>>>>>>> Popups
>>>>>>> PrefUpdate
>>>>>>> Print
>>>>>>> QuickSurveyInitiation
>>>>>>> QuickSurveysResponses
>>>>>>> ReadingDepth
>>>>>>> ResourceTiming
>>>>>>> SaveTiming
>>>>>>> SearchSatisfaction
>>>>>>> SearchSatisfactionErrors
>>>>>>> ServerSideAccountCreation
>>>>>>> TemplateWizard
>>>>>>> TestSearchSatisfaction2
>>>>>>> TranslationRecommendationAPIRequests
>>>>>>> TranslationRecommendationUIRequests
>>>>>>> TranslationRecommendationUserAction
>>>>>>> TwoColConflictConflict
>>>>>>> UniversalLanguageSelector
>>>>>>> UploadWizardErrorFlowEvent
>>>>>>> UploadWizardExceptionFlowEvent
>>>>>>> UploadWizardFlowEvent
>>>>>>> UploadWizardStep
>>>>>>> UploadWizardTutorialActions
>>>>>>> UploadWizardUploadFlowEvent
>>>>>>> VirtualPageView
>>>>>>> WikidataCompletionSearchClicks
>>>>>>> WikimediaBlogVisit
>>>>>>> WikipediaPortal
>>>>>>> WikipediaZeroUsage
>>>>>>> WMDEBannerEvents
>>>>>>> WMDEBannerSizeIssue
>>>>>>> ___
>>>>>>> Analytics mailing list
>>>>>>> Analytics@lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>> ___
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Tilman Bayer
>>>> Senior Analyst
>>>> Wikimedia Foundation
>>>> IRC (Freenode): HaeB
>>>>
>>>
>>>
>>> --
>>> Tilman Bayer
>>> Senior Analyst
>>> Wikimedia Foundation
>>> IRC (Freenode): HaeB
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Question about the "Page Views" tool

2018-03-13 Thread Marcel Ruiz Forns

>
> Ok, so there isn't really a way to just install and use the WMF tool?

I don't think so, not without the work suggested in previous emails in this
thread.
Sorry

On Wed, Mar 7, 2018 at 4:57 PM, Reception123 . <
utilizator.receptie...@gmail.com> wrote:

> Ok, so there isn't really a way to just install and use the WMF tool?
>
> Reception123
> System Administrator (Operations),
> Miraheze
>
> On 7 March 2018 at 16:56, Federico Leva (Nemo) <nemow...@gmail.com> wrote:
>
>> Reception123 ., 06/03/2018 08:25:
>>
>>> I was wondering how one could install and use the "Page Views" tool that
>>> Wikimedia uses, on a non-WMF wiki.
>>>
>>
>> I guess you could rebuild the entire cache and analytics clusters from
>> puppet (supposedly documented somewhere around <
>> https://wikitech.wikimedia.org/wiki/Analytics>), or write something from
>> scratch that would expose data with the same API format.
>>
>> Federico
>>
>
>
> ___________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Question about the "Page Views" tool

2018-03-07 Thread Marcel Ruiz Forns

Hi Reception123,

I assume you talk about this tool: https://tools.wmflabs.org/pageviews.
It's an opensource project hosted in
https://github.com/MusikAnimal/pageviews.
However, it uses Analytics Query Service[1] as a data source, which, as
Nemo indicates,
is populated by a pipeline of complex systems.

Not sure there's a way to make it work for a non-WMF wiki...
Unless you create your own statistics data source and make the page views
tool consume it.

Maybe someone has an idea? But I'm pessimistic :/

[1] https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS

On Tue, Mar 6, 2018 at 7:25 AM, Reception123 . <
utilizator.receptie...@gmail.com> wrote:

> Hello,
>
> I was wondering how one could install and use the "Page Views" tool that
> Wikimedia uses, on a non-WMF wiki.
>
> Reception123
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] PageView

2018-03-02 Thread Marcel Ruiz Forns

Sorry, forwarding to Analytics...

Hi Angelina,

I don't think there's any (legal) way of tracking Wikipedia traffic.
All Wikipedia traffic data is protected by WMF's privacy policy[1]
and handled accordingly.

We do, however, provide public sanitized high-level statistics on page
views for Wikipedia in various ways (not to specific companies or
organizations, but rather to the world at large). What "Next Big Sound"
is probably doing, is consuming one of those public sources, but we
don't know which one.

These are 2 of the main sources this company might be grabbing stats from:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
https://dumps.wikimedia.org/

Cheers!

[1] https://wikimediafoundation.org/wiki/Privacy_policy



On Fri, Mar 2, 2018 at 5:16 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
wrote:

> Hi Angelina,
>
> I don't think there's any (legal) way of tracking Wikipedia traffic.
> All Wikipedia traffic data is protected by WMF's privacy policy[1]
> and handled accordingly.
>
> We do, however, provide public sanitized high-level statistics on page
> views for Wikipedia in various ways (not to specific companies or
> organizations, but rather to the world at large). What "Next Big Sound"
> is probably doing, is consuming one of those public sources, but we
> don't know which one.
>
> These are 2 of the main sources this company might be grabbing stats from:
> https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
> https://dumps.wikimedia.org/
>
> Cheers!
>
> [1] https://wikimediafoundation.org/wiki/Privacy_policy
>
>
> On Fri, Mar 2, 2018 at 4:19 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
> wrote:
>
>> Oh, forgot the subscribe link, here:
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>> Cheers!
>>
>> On Fri, Mar 2, 2018 at 4:18 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
>> wrote:
>>
>>> Hi Angelina,
>>>
>>> I'm the administrator of this mailing-list. Just to let you know that
>>> your email was automatically filtered out by the mailing-list bot because
>>> your address is not subscribed to it. I just unblocked it, so yopu will
>>> receive a response in short. However, please subscribe to send further
>>> emails to the list.
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Feb 28, 2018 at 5:04 PM, BTShasSTOLENmyHEART <
>>> zangeli...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I recently spoke with "Next Big Sound" which is a company that tracks
>>>> Wikipedia page views on certain artists. They informed me that they got
>>>> details of the views directly from Wikipedia (because I had emailed them
>>>> that the View counts mentioned on Wikipedia and Next Big Sound show a major
>>>> discrepancy). There are rumors flying about saying that the information
>>>> only gathered is from Desktop Views, in which the counts are extremely
>>>> similar. Is there any way you can confirm this as true? Or is there another
>>>> method you also count that is gathered for other companies that collect
>>>> views? I know you have no idea of what Next Big Sound is presenting to the
>>>> world wide audience, but I wanted to know if you can explain what
>>>> information is given to Next Big Sound in terms of data. Thank you
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> Angelina Zamora
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> *Marcel Ruiz Forns*
>>> Analytics Developer
>>> Wikimedia Foundation
>>>
>>
>>
>>
>> --
>> *Marcel Ruiz Forns*
>> Analytics Developer
>> Wikimedia Foundation
>>
>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Pageview dumps lagging behind

2018-02-20 Thread Marcel Ruiz Forns

Hi John!

We send all our scheduled maintenance notifications to this mailing list.
You can subscribe here:
https://lists.wikimedia.org/mailman/listinfo/analytics

Cheers!

On Fri, Feb 16, 2018 at 5:48 PM, John Urbanik <jurba...@predata.com> wrote:

> Our team would greatly appreciate scheduled maintenance notifications for
> maintenance that would impact analytics services. Perhaps an additional
> list can be set up?
>
> On Fri, Feb 16, 2018 at 11:45 AM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> Oh, my fault, this message is from a while back.  We had to pause the
>> cluster for a few days to do a big upgrade, now everything is operational
>> and you should be seeing data and dumps usually within 24 hours of when
>> you'd expect them.  If that's not the case, either we're performing some
>> scheduled maintenance or something could be wrong (but that rarely happens
>> and we usually announce it here if it does).  Going forward, would people
>> on this list like to be notified of scheduled maintenance?  It might be
>> spam for most people so we usually don't post a message about it.
>>
>> On Fri, Feb 16, 2018 at 11:43 AM, Dan Andreescu <dandree...@wikimedia.org
>> > wrote:
>>
>>> Hi, how are you deducing that, I show files up to 2018-02-16 14:00:00
>>> (UTC) which is very up to date, only a few hours ago.
>>>
>>> On Sun, Feb 11, 2018 at 4:57 AM, Spinner Cat <pogf...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Noticed that we're not getting any new pageview dumps on
>>>> https://dumps.wikimedia.org/other/pageviews/2018/2018-02/ since Feb
>>>> 9th 17:08 UTC. Is this a known issue and when might we expect it to be
>>>> resolved and files to catch up again?
>>>>
>>>> Thanks!
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
>
> *JOHN URBANIK*
> Lead Data Engineer
>
> jurba...@predata.com
> 860.878.1010 <(860)%20878-1010>
> 379 West Broadway
> New York, NY 10012
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Undocumented project code in pagecounts-ez

2017-11-14 Thread Marcel Ruiz Forns

Hi Michael!

Yes, the ".m" code can stand for either being a *.mediawiki.org project or
for being a mobile wiki (you can separate both cases).
See the docs here:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-all-sites#Disambiguating_abbreviations_ending_in_.E2.80.9C.m.E2.80.9D

I created a task to add some more documentation to the page you linked:
https://phabricator.wikimedia.org/T180452

Thanks a lot!

On Tue, Nov 14, 2017 at 3:43 AM, Michael Baldwin <mjbaldwi...@gmail.com>
wrote:

> Hi,
>
> I've been using the very helpful pagecount dumps described at:
>
> https://dumps.wikimedia.org/other/pagecounts-ez/
>
> And it describes:
>
> Line format:
>
> wiki code (subproject.project)
> article title
> monthly total (with interpolation when data is missing)
> hourly counts
>
> In the wiki code field, the subproject is the language code (fr, el,
> ja, etc) or meta, commons etc.
>
> The project is one of b (wikibooks), k (wiktionary), n (wikinews), o
> (wikivoyage), q (wikiquote), s (wikisource), v (wikiversity), z (wikipedia).
>
> However, I've been coming across a large number of wiki codes "en.m". The
> "m" code is undocumented. It appears to be the mobile version of Wikipedia,
> but can anyone confirm that? Should the page be updated with this
> information?
>
> Thanks,
> Michael
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Anybody know about stats.grok.se going down?

2017-08-21 Thread Marcel Ruiz Forns

>>> (available at https://wikimedia.org/api/rest_v1/)
>>>>>
>>>>> and:
>>>>>
>>>>> https://dumps.wikimedia.org/other/pagecounts-ez/
>>>>>
>>>>> On Mon, Aug 7, 2017 at 4:21 AM, Dan Garry <dga...@wikimedia.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Vipul,
>>>>>>
>>>>>> stats.grok.se is pretty much deprecated now. You ran in to one of
>>>>>> the reasons why: it's not very reliable. You should use the Pageviews
>>>>>> Analysis <https://tools.wmflabs.org/pageviews/> tool instead, which
>>>>>> was put together by MusikAnimal and Community Tech. This tool was 
>>>>>> intended
>>>>>> to replace stats.grok.se. There is documentation
>>>>>> <https://meta.wikimedia.org/wiki/Community_Tech/Pageview_stats_tool> 
>>>>>> about
>>>>>> the tool that you may wish to read.
>>>>>>
>>>>>> Thanks,
>>>>>> Dan
>>>>>>
>>>>>> On 7 August 2017 at 06:34, Vipul Naik <vipulna...@gmail.com> wrote:
>>>>>>
>>>>>>> stats.grok.se (a source of pageview stats for the time before the
>>>>>>> Wikimedia API became available) has been down for about a week. I tried
>>>>>>> emailing Henrik Abelsson, whom I've previously contacted when the site 
>>>>>>> had
>>>>>>> issues, but haven't received a response this time.
>>>>>>>
>>>>>>> Any ideas on why it's down and whom to reach out to to help resolve
>>>>>>> the issue?
>>>>>>>
>>>>>>> Vipul
>>>>>>>
>>>>>>> ___
>>>>>>> Analytics mailing list
>>>>>>> Analytics@lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dan Garry
>>>>>> Senior Product Manager, Editing
>>>>>> Wikimedia Foundation
>>>>>>
>>>>>> ___
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> ___
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Request for analytics data

2017-03-06 Thread Marcel Ruiz Forns

Hi Jörg, :]

Do you mean top 250K most viewed *articles* in de.wikipedia.org?

If so, I think you can get that from the dumps indeed. You can find 2016
hourly pageview stats by article for all wikis here:
https://dumps.wikimedia.org/other/pageviews/2016/

Note that the wiki codes (first column) you're interested in are: *de*,
*de.m* and *de.zero*.
The third column holds the number of pageviews you're after.
Also, this data set does not include bot traffic as recognized by the pageview
definition <https://meta.wikimedia.org/wiki/Research:Page_view>.
As files are hourly and contain data for all wikis, you'll need some
aggregation and filtering.

Cheers!

On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.j...@retevastum.de> wrote:

> Ladies, gents,
>
> for a project i plan i'd need the following data:
>
> Top 250K sites for 2016 in project de.wikipedia.org, user-access.
>
> I only need the name of the site and the corrsponding number of
> user-accesses (all channels) for 2016 (sum over the year).
>
> As far as i can see i can't get that data via REST or by aggegating dumps.
>
> So i'd like to ask here, if someone likes to helpout.
>
> Thanx, cheers, JJ
>
> --
> Jörg Jung, Dipl. Inf. (FH)
> Hasendriesch 2
> D-53639 Königswinter
> E-Mail: joerg.j...@retevastum.de
> Web:www.retevastum.de
> www.datengraphie.de
> www.digitaletat.de
> www.olfaktum.de
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] private learning (collaboration) project

2016-12-20 Thread Marcel Ruiz Forns

Hi Alexander!

This indeed seems like an interesting project. Responding to your
suggestions:

First, I am ready to collaborate with you on making this data available as
> other researchers have done in the past. I would appreciate if you let me
> know which steps I need to take in order to work with you on this task.


I'd suggest you apply for a research project here[1]. The research team
will discuss the project with you. And if it gets approved, you can sign
and NDA and have access to the raw data. You can also apply for a grant
here[2].


> Second, you can consider making this data available after achieving the
> necessary level of confidentiality. For example, you can group request
> types so that each group has at least 1000 unique IP-addresses.


There are a couple tasks[3] in our backlog about effectively anonymizing
the pageview data for a general purpose. We used an algorithm similar to
what you proposed. Our experience, though, is that anonymization (for
general purpose) is a non-trivial task. We plan to work on this in the
mid-term (actually, we already started to work on it, see tasks) but we
have other priorities for the next quarter. I'd suggest again that you
apply for a specific project for the needs of your study here[1][2].

Another challenge, I guess, would be categorizing the articles as
educational or entertainment. The categories in Wikipedia are a cool way to
browse, but not an exact way of clustering contents. And I guess the
frontier between educational and entertainment can be sometimes fuzzy, no?
A very interesting challenge anyway.

cheers!

[1] https://meta.wikimedia.org/wiki/Research:New_project
[2] https://meta.wikimedia.org/wiki/Grants:Project
[3]
https://phabricator.wikimedia.org/T114675
https://phabricator.wikimedia.org/T118839
https://phabricator.wikimedia.org/T118838
https://phabricator.wikimedia.org/T118841


On Wed, Dec 14, 2016 at 5:02 PM, Alexander Ugarov <auga...@email.uark.edu>
wrote:

> Dear members of the Analytics Team!
>
> Please, consider my request for information or collaboration. I am
> conducting the research project on the international determinants of
> education quality. In my view, Wikimedia statistics is the priceless
> resource of information on how much learning people do outside of
> educational institutions.
>
> I would like to access the data on Wikipedia pageviews by country,
> language and content area to measure the private learning in different
> countries. My previous empirical results suggest that Wikipedia pageviews
> are highly correlated with education quality. Unfortunately, the available
> data does not allow to separate the educational pageviews from the pure
> entertainment pageviews (for example, celebrities biographies).
>
> I am aware that the data currently is not the part of the publicly
> available dataset. Please, consider two options. First, I am ready to
> collaborate with you on making this data available as other researchers
> have done in the past. I would appreciate if you let me know which steps I
> need to take in order to work with you on this task. Second, you can
> consider making this data available after achieving the necessary level of
> confidentiality. For example, you can group request types so that each
> group has at least 1000 unique IP-addresses.
>
> I am looking forward to hear from you on my opportunities to use this
> data. I think that it is going to be very interesting to know how much
> people learn from Wikipedia, for example, in India versus Brazil and Egypt.
> Do people in Indonesia learn less than people in Germany due to poor
> quality school systems or low private incentives for learning? I am also
> sure that many social scientists will also benefit from using such
> information (if you make it available) and will produce some
> policy-relevant research.
>
>
> Best regards,
> Alexander Ugarov,
> Ph. D. Candidate.
> Sam M. Walton College of Business
> Department of Economics
> University of Arkansas
> Office: ECOB260
> E-mail: auga...@uark.edu.
>
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Making Charts More Interactive

2016-11-16 Thread Marcel Ruiz Forns

Dear Dhaya,

Thanks for your comments!

The general legibility of Charts in wikipedia are relatively poor.
> We can improve it with making them more interactive and dynamic.


I agree with you that there is room for improvement when it comes to
visualizations in Wikipedia.
Actually, "Handling wiki content beyond plaintext" (which includes graphs)
is one of the hot topics of the Mediawiki Developer Summit[1] in January
2017.
Also, there's the awesome Graph Extension[2] that lets you add interactive
dynamic visualizations to the wiki pages.

Cheers!

[1] https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit
[2 https://www.mediawiki.org/wiki/Extension:Graph]


On Wed, Nov 2, 2016 at 6:53 PM, dhayakar marur <dm.ma...@gmail.com> wrote:

> Dear Analytics team,
>
> The general legibility of Charts in wikipedia are relatively poor.
> We can improve it with making them more interactive and dynamic.
> Please refer to the Chart in the attachment (Boloid Events.jpg).
>
> The chart represents the distribution of Bolide events from 1994-2013 on
> the world map.
> The legend describe the magnitude of each event in Joules.
> From the chart can you count the number of 10GJ Bolide events in Africa?
> You can count, but we take an awfully long time to find the answer.
>
> If we were to make the legend Interactive and the world map dynamic, we
> can improve legibility.
> We should making all the values (1 GJ, 10GJ etc) in the legend as
> clickable buttons.
> On clicking say 10kJ the World Map should show Boloid Events of 10GJ
> magnitude and remove the rest. This will make it easier to answer my
> earlier question.
>
> Regards
> Dhaya
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] High number of pageviews on page with single hyphen as title

2016-11-16 Thread Marcel Ruiz Forns

Maybe the high value in October (45M) has something to do with the last
changes in https://phabricator.wikimedia.org/T145922 ?

On Mon, Nov 14, 2016 at 9:25 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> This is documented now here:
>
> https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI#Gotchas
>
> On Tue, Nov 8, 2016 at 7:25 AM, Vipul Naik <vipulna...@gmail.com> wrote:
>
>> Hi Joseph,
>>
>> Thanks for the clarification.
>>
>> Any ideas why this number is much higher for some months? In particular,
>> on desktop, it's high in the months of July to September 2015 (around 10
>> million, compared to the usual 5 million) and then high again in October
>> 2016 (45 million, about 10x the usual value).
>>
>> Data is from http://wikipediaviews.org/displayviewsformultiplemonths
>> .php?page=-=allmonths=all which summarizes results
>> from the Wikimedia API (and stats.grok.se for data before July 2015).
>>
>> Vipul
>>
>> On Tue, Nov 8, 2016 at 3:46 AM, Joseph Allemandou <
>> jalleman...@wikimedia.org> wrote:
>>
>>> Hello Issa,
>>>
>>> Thank you for your question.
>>> The very high number of views of the "-" page is explained by this dash
>>> value being used as a special value for "no page title found" when
>>> extracting titles from urls.
>>> We definitely should document this in the API, creating this task:
>>> https://phabricator.wikimedia.org/T150249
>>> Best
>>> Joseph
>>>
>>>
>>> On Tue, Nov 8, 2016 at 12:28 AM, Issa Rice <ricei...@gmail.com> wrote:
>>>
>>>> Dear Analytics Mailing List,
>>>>
>>>> Recently while querying pageviews of various pages, I discovered that
>>>> the page whose title is a single hyphen character (i.e. with the title
>>>> "-", with URL <https://en.wikipedia.org/wiki/->, which redirects to
>>>> <https://en.wikipedia.org/wiki/Hyphen-minus>) receives an unusually
>>>> high
>>>> number of pageviews under the Pageview API. Taking October 2015 as an
>>>> example, the page received 5.4 million pageviews during that month
>>>> according to the API:
>>>> <https://wikimedia.org/api/rest_v1/metrics/pageviews/per-art
>>>> icle/en.wikipedia/desktop/user/-/daily/20151001/20151031>.
>>>>
>>>> However, according the stats.grok.se (which was still operational in
>>>> the
>>>> same month), the page received only 1209 pageviews:
>>>> <http://stats.grok.se/en/201510/->.
>>>>
>>>> Looking at the tabulation of pageviews on Wikipedia Views, the increase
>>>> in pageviews for this page coincides with the change to the Pageview
>>>> API in July 2015:
>>>> <http://wikipediaviews.org/displayviewsformultiplemonths.php
>>>> ?page=-=allmonths=all>.
>>>>
>>>> As I understand, page titles must be URL-encoded before the query,
>>>> but the URL-encoding of "-" is itself.
>>>>
>>>> I looked at the API documentation but did not see this behavior listed,
>>>> so I am wondering where these numbers are coming from.
>>>>
>>>> Best regards,
>>>> Issa
>>>>
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> *Joseph Allemandou*
>>> Data Engineer @ Wikimedia Foundation
>>> IRC: joal
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread Marcel Ruiz Forns

Hi Pine,

I thought that was specified in either the Privacy Policy or Terms of Use
> but I can't find the specific reference, and that bothers me.


This is specified in the data retention guidelines:
https://meta.wikimedia.org/wiki/Data_retention_guidelines

Cheers!

On Fri, Nov 11, 2016 at 4:11 PM, James Salsman <jsals...@gmail.com> wrote:

> Pine wrote:
> >
> > I tend to think that checkusers will need the plain IP addresses
>
> I am not suggesting removing the IP addresses or proxy information from
> POST requests as checkuser requires.
>
> We need to anonymize both IP addresses and proxy information with a secure
> hash if we want to keep each GET request's geolocation, to be compliant
> with the Privacy Policy. The Privacy Policy is the most prominent policy on
> the far left on the footer of every page served by every editable project,
> and says explicitly that consent is required for the use of geolocation.
> The Privacy and other policies make it clear that POST requests and Visual
> Editor submissions aren't going to be anonymized.
>
> However, geolocations for POST edit and visual editor submissions still
> require explicit consent which we have no way to obtain at present.
> Editors' geolocations as they edit are very useful for research, but by the
> same token have the most serious privacy concerns. Obtaining consent to
> store geolocation seems like it would interfere with, complicate, and
> disrupt editing. If geolocation is stored with anonymized IP addresses for
> GETs but not POSTs or Visual Editor submissions, both could easily be
> recovered because of simultaneously interleaved GET and POST requests for
> the same article are unavoidable.
>
> Do we have any privacy experts on staff who can give these issues a
> thorough analysis in light of all the issues raised in
> https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?
>
> If Ops needs IP addresses, they should be able to use synthetic POST
> requests, as far as I can tell. If they anticipate a need for non-anonymous
> GET requests, then perhaps some kind of a debugging switch which could be
> used on a short term basis where an IP range or mask could be entered to
> allow matching addresses to log non-anonymously before expiring in an hour
> would solve any anticipated need?
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Parsing user agents in EventLogging data

2016-09-15 Thread Marcel Ruiz Forns

Just a heads up:

user_agent field is a PII field (privacy sensitive), and as such it is
purged after 90 days. If there would be a user_agent_map field, it should
be purged after 90 days as well.

Another more permanent option might be to detect the browser family on the
JavaScript client with i.e. duck-typing[1] and send it as part of the
explicit schema. The browser family by itself is not identifying enough to
be considered PII, and could be kept indefinitely.

[1]
http://stackoverflow.com/questions/9847580/how-to-detect-safari-chrome-ie-firefox-and-opera-browser

On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <jane...@gmail.com> wrote:

> It's not just a question of which value to choose, but also how to sort.
> It would be nice to be able to choose sorting in alphabetical order vs
> numerical order. It would also be nice to assign a default sort to any item
> label that is taken from the Wikipedia {{DEFAULTSORT}} template (though
> that won't work for items without a Wikipedia article).
>
> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> The problem with working on EL data in hive is that the schemas for the
>> tables can change at any point, in backwards-incompatible ways.  And
>> maintaining tables dynamically is harder here than in mysql world (where EL
>> just tries to insert, and creates the table on failure).  So, while it's
>> relatively easy to use ua-parser (see below), you can't easily access EL
>> data in hive tables.  However, we do have all EL data in hadoop, so you can
>> access it with Spark.  Andrew's about to answer with more details on that.
>> I just thought this might be useful if you sqoop EL data from mysql or
>> otherwise import it into a Hive table:
>>
>>
>> from stat1002, start hive, then:
>>
>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/
>> analytics/refinery/refinery-hive-0.0.35.jar;
>>
>> CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin
>> ery.hive.UAParserUDF';
>>
>> select ua_parser('Wikimedia Bot');
>>
>> On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) <nemow...@gmail.com
>> > wrote:
>>
>>> Tilman Bayer, 15/09/2016 01:21:
>>>
>>>> This came up recently with the Reading web team, for the purpose of
>>>> investigating whether certain issues are caused by certain browsers
>>>> only. But I imagine it has arisen in other places as well.
>>>>
>>>
>>> Definitely. https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitiz
>>> ation
>>>
>>> Nemo
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] question about Pageviews dumps

2016-07-01 Thread Marcel Ruiz Forns

If we were doing this internally, a possibility would be to instrument
MediaWiki and send sampled events with the time on page to EventLogging.
This would not be retroactive though, we would have to wait a couple months
to collect significant data. In any case, I'm not sure if this would be
possible with an NDA?

On Fri, Jul 1, 2016 at 11:52 AM, Marc Miquel <marcmiq...@gmail.com> wrote:

> I see it is quite complicated to work with this data. It is a pity
> considering that valuable insights could be driven by readers' behaviors. I
> will think about what can be useful for the study.
>
> Thanks for the answers, Nuria and Marcel! :)
> Cheers,
>
> Marc
>
> El dj., 30 juny 2016 a les 14:16, Marcel Ruiz Forns (<mfo...@wikimedia.org>)
> va escriure:
>
>> Marc, I also see what Nuria says. Also please consider that the majority
>> of Wikipedia sessions have only one pageview. So in the majority of
>> sessions it would not be possible to approximate the time spent on page
>> with boundaries with Joseph's alternative.
>>
>> On Thu, Jun 30, 2016 at 2:02 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>
>>> >Aye, as Joseph says, the time-on-page or time-leaving is not
>>> collected, except as an extension of session reconstruction work. If you
>>> want a >concrete time, you're not gonna get it.
>>>
>>> I was about to make the same point, the data set that will most closely
>>> answer your questions is the one Oliver mentioned, otherwise we do not keep
>>> any information related to time on site and page requests so there is no
>>> "approximation" possible that will work on overall data. Even if you
>>> calculate signatures with IP-hash +user agent to approximate users (a
>>> method with known issues) there is no way for you to distinguish someone
>>> reading a page for an hour and someone that came to wikipedia twice in the
>>> same hour and spent a minute each time. Hopefully my example makes things
>>> more clear.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>> On Wed, Jun 29, 2016 at 4:58 AM, Oliver Keyes <ironho...@gmail.com>
>>> wrote:
>>>
>>>> Aye, as Joseph says, the time-on-page or time-leaving is not collected,
>>>> except as an extension of session reconstruction work. If you want a
>>>> concrete time, you're not gonna get it.
>>>>
>>>> While PC-based data is more reliable than mobile, that does not
>>>> necessarily mean "reliable". I'm sort of confused, I guess, as to why the
>>>> datasets I linked (unless I'm misremembering them?) don't help: you would
>>>> have to do the calculation yourself but they should contain all the data
>>>> necessary to make that calculation (unless you want to have the pageID or
>>>> title associated with the time-on-page, in which case...yeah, that's an
>>>> issue).
>>>>
>>>> On Wed, Jun 29, 2016 at 3:16 AM, Marc Miquel <marcmiq...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the answer, Oliver. But I am not sure it answers my
>>>>> questions. I'd like to study aspects like how much time is spent in
>>>>> certain pages, as a proxy of how content is approached/read/understood. 
>>>>> I'd
>>>>> be happy with time of entering the page, time of leaving. This is not
>>>>> entirely centered on 'user activity', but I said that because I imagined
>>>>> data would be stored in a similar way to editor sessions, or in a database
>>>>> and I would need to do the time calculations.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Marc
>>>>>
>>>>>
>>>>> El dc., 29 juny, 2016 03:11, Oliver Keyes <ironho...@gmail.com> va
>>>>> escriure:
>>>>>
>>>>>> If historic data is okay, there's already a dataset released (
>>>>>> https://figshare.com/articles/Activity_Sessions_datasets/1291033)
>>>>>> that was designed specifically to answer questions around how to best
>>>>>> calculate session length with regards to Wikipedia (
>>>>>> http://arxiv.org/abs/1411.2878)
>>>>>>
>>>>>> On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel <marcmiq...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> I was thinking about user sessions, yes, so t

Re: [Analytics] analytics-store unscheduled maintenance

2016-05-27 Thread Marcel Ruiz Forns

Thanks Jaime!

On Thu, May 26, 2016 at 8:14 PM, Jaime Crespo <jcre...@wikimedia.org> wrote:

> The server seems to be back in a relatively good state; however it
> will be behind in replication both for s* shards (wiki data) and the
> eventlogging database; I would suggest to wait for a day if your data
> needs fresh results. We will be monitoring that this happens
> correctly.
>
> You can follow up pending fixes and the latest updates on the ticket:
> https://phabricator.wikimedia.org/T136333
>
> Regards,
>
> On Thu, May 26, 2016 at 7:31 PM, Jaime Crespo <jcre...@wikimedia.org>
> wrote:
> > Hi,all,
> >
> > a few minutes ago dbstore1002, (I think you know it better as
> > analytics-store) was forced to have an unscheduled maintenance A.K.A
> > "it crashed and I am trying to give it first aid".
> >
> > Please use db1047 (analytics-slave?) for now, if you can.
> >
> > I will follow up with a state update once I know more.
> >
> > Sorry for the inconveniences,
> > --
> > Jaime Crespo
> > <http://wikimedia.org>
>
>
>
> --
> Jaime Crespo
> <http://wikimedia.org>
>
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] statistics about user agents per page or per namespace

2016-04-29 Thread Marcel Ruiz Forns

Hi Amir,

Would it be crazy to ask for statistics of user agents per page or per
> namespace?

I'd hypothesize, for example, that IE is used much less outside of the
> article and portal namespaces.


I don't think it's crazy. I +1 your hypothesis :]
But, yes, I see it difficult to implement:

   - The new user-agent breakdown reports (browser-reports.wmflabs.org) are
   derived from the pageview_hourly table which comes from the webrequest
   table, both in hadoop. None of them has structured information about the
   namespace. It should be parsed from the URL or other fields, but the
   namespaces have different names in different languages, so this would be
   very tricky.

   - Breaking down the user-agent statistics per article would be also very
   expensive computationally, given the high number of articles. The Pageview
   API shows pageviews per article, and to reach this we Analytics have had to
   solve storage and data loading and compression problems, that arose from
   the big size of that data. Having user-agent breakdown per article would
   mean multiplying the size of that by a lot.

Maybe I'm missing an easier way to do it, but it seems it would take a long
time to solve this.

On Fri, Apr 29, 2016 at 1:04 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Would it be crazy to ask for statistics of user agents per page or per
> namespace?
>
> I'd hypothesize, for example, that IE is used much less outside of the
> article and portal namespaces.
>
> In case you're wondering what is it useful for: When I have a patch that
> requires browser compatibility trickery, I may want to invest less time in
> IE compatibility in a page that is unlikely to be viewed in IE much.
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Analytics Digest, Vol 50, Issue 21

2016-04-26 Thread Marcel Ruiz Forns

Hi Edo!


> The nl-wiki has 300.000-350.000 hits on the main page per day. The rest of
> the top10 drops quickly down to 5000 hits per day, a reasonable amount. But
> the 1.500.000 unique visitors per day then seems overstated, when I do a
> rough estimate, it looks like 1.500.000 is the total number of page views,
> the number of unique devices must be a lot smaller then.


It does look like this, but if you sum all view counts for the top 993 most
visited articles (
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/nl.wikipedia.org/all-access/2016/02/01),
it adds up to 1.203.816. Also, nl.wikipedia.org has more than 1 million
articles, I think the rest of articles not mentioned in the top would raise
that count far above the 1.500.000. We should also consider that the
majority of wikipedia visitors do 1-article-only lookups. So, I
think 1.635.478 of unique devices makes sense.

Cheers

On Wed, Apr 20, 2016 at 1:38 PM, Edo de Roo  wrote:

> Gergo Tisza makes a valid point.
>
> The nl-wiki has 300.000-350.000 hits on the main page per day. The rest of
> the top10 drops quickly down to 5000 hits per day, a reasonable amount.
> But the 1.500.000 unique visitors per day then seems overstated, when I do
> a rough estimate, it looks like 1.500.000 is the total number of page
> views, the number of unique devices must be a lot smaller then.
>
> See
> https://wikimedia.org/api/rest_v1/metrics/unique-devices/nl.wikipedia.org/all-sites/daily/20160201/20160201
>
> Edo de Roo
> nl-wiki, wikidata
>
> On Tue, Apr 19, 2016 at 11:50 PM, 
> wrote:
>
>> Send Analytics mailing list submissions to
>> analytics@lists.wikimedia.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> or, via email, send a message with subject or body 'help' to
>> analytics-requ...@lists.wikimedia.org
>>
>> You can reach the person managing the list at
>> analytics-ow...@lists.wikimedia.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Analytics digest..."
>>
>>
>> Today's Topics:
>>
>>1. Unique Devices data available on API (Nuria Ruiz)
>>2. Hive & Oozie downtime tomorrow (Andrew Otto)
>>3. Re: Unique Devices data available on API (Gergo Tisza)
>>4. Re: Unique Devices data available on API (Kevin Leduc)
>>
>>
>> --
>>
>> Message: 1
>> Date: Tue, 19 Apr 2016 12:17:12 -0700
>> From: Nuria Ruiz 
>> To: "A mailing list for the Analytics Team at WMF and everybody who
>> has an interest in Wikipedia and analytics."
>> ,  Wikimedia developers
>> ,
>> wiki-researc...@lists.wikimedia.org
>> Subject: [Analytics] Unique Devices data available on API
>> Message-ID:
>>

[Analytics] Edit-Analysis Dashboard back on track

2016-03-22 Thread Marcel Ruiz Forns

Hi editing,

Just to let you know that after the modifications to the Edit table in EL
database, the reports have been able to catch up and back-fill until today.
So https://edit-analysis.wmflabs.org/compare/ is working again.

Cheers!

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikimediaBot convention

2016-02-03 Thread Marcel Ruiz Forns

s part of your detection should capture most
> well written bots.
> Also including any requests from tools.wmflabs.org and friends as
> 'bot' might also be a useful improvement.

That is a very good insight. Thanks. Currently, the User-Agent policy is
not implemented in our regular expressions, meaning: it does not match
emails, nor user pages or other mediawiki urls. It could also, as you
suggest, implement matching github accounts, or tools.wmflabs.org. We
Analytics should tackle that. I will create a task for that and add it to
the proposal.

Thanks again, in short I'll send the proposal with the changes.

On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <jay...@gmail.com>
wrote:

> On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <mfo...@wikimedia.org>
> wrote:
> > Hi all,
> >
> > It seems comments are decreasing at this point. I'd like to slowly drive
> > this thread to a conclusion.
> >
> >
> >> 3. Create a plan to block clients that dont implement the (amended)
> >> User-Agent policy.
> >
> >
> > I think we can decide on this later. Steps 1) and 2) can be done first -
> > they should be done anyway before 3) - and then we can see how much
> benefit
> > we raise from them. If we don't get a satisfactory reaction from
> > bot/framework maintainers, we then can go for 3). John, would you be OK
> with
> > that?
>
> I think you need to clearly define what you want to capture and
> classify, and re-evaluate what change to the user-agent policy will
> have any noticeable impact on your detection accuracy in the next five
> years.
>
> The eventual definition of 'bot' will be very central to this issue.
> Which tools need to start adding 'bot'?  What is 'human' use?  This
> terminology problem has caused much debate on the wikis, reaching
> arbcom several times.  So, precision in the definition will be quite
> helpful.
>
> One of the strange area's to consider is jquery-based tools that are
> effectively bots, performing large numbers of operations on pages in
> batches with only high-level commands being given by a human.  e.g.
> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
> scripts are also not a 'bot'.
>
> If gadgets and user-scripts may need to follow the new 'bot' rule of
> the user-agent policy, the number of developers that need to be
> engaged is much larger.
>
> If the proposal is to require only 'bot' in the user-agent,
> pywikipediabot and pywikibot both need no change to add it (yay!, but
> do we need to add 'human' to the user-agent for some scripts??), but
> many client frameworks will still need to change their user-agent,
> including for example both of the Go frameworks.
> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
>
> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21
>
> By doing some analysis of the existing user-agents hitting your
> servers, maybe you can find an easy way to grandfather in most client
> frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
> frameworks are automatically now also supported.
>
> Please understand the gravity of what you are imposing.  Changing a
> user-agent of a client is a breaking change, and any decent MediaWiki
> client is also used by non-Wikimedia wikis, administrated by
> non-Wikimedia ops teams, who may have their own tools doing analysis
> of user-agents hitting their servers, possibly including access
> control rules.  And their rules and scripts may break when a client
> framework changes its user-agent in order to make the Wikimedia
> Analytics scripts easier.  Strictly speaking your user-agent policy
> proposal requires a new _major_ release for every client framework
> that you do not grandfather into your proposed user-agent policy.
>
> Poorly written/single-purpose/once-off clients are less of a problem,
> as forcing change on them has lower impact.
>
> [[w:User_agent]] says:
>
> "Bots, such as Web crawlers, often also include a URL and/or e-mail
> address so that the Webmaster can contact the operator of the bot."
>
> So including URL/email as part of your detection should capture most
> well written bots.
> Also including any requests from tools.wmflabs.org and friends as
> 'bot' might also be a useful improvement.
>
> The `analytics-refinery-source` code currently differentiates between
> spider and bot, but earlier in this thread you said
>
>   'I don't think we need to differentiate between "spiders" and "bots".'
>
> If you require 'bot' in the user-agent for bots, this will also
> capture Googlebot and YandexBot, and many other tools which use 'bot'
> .  Do you want Goog

Re: [Analytics] WikimediaBot convention

2016-02-03 Thread Marcel Ruiz Forns

Hi again analytics list,

Thank you all for your comments and feedback!
We consider this thread closed and will now proceed to:

   1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy,
   to encourage including (optional) the word "bot" (case-insensitive) in the
   User-Agent string, so that bots that generate pageviews not consumed onsite
   by humans can be easily identified by the Analytics cluster, thus
   increasing accuracy of the human-vs-bot traffic split.

   2. Advertise the convention and reach out to bot/framework maintainers
   to increase the share of bots that implement the User-Agent policy.

   3. The Analytics team should implement the regular expressions that
   match the current User-Agent policy: User-Agent strings with: emails, user
   pages, other mediawiki urls, github urls, and tools.wmflabs.org urls.
   This will take some time, and probably raise technical issues, but seems
   that we can benefit from it. https://phabricator.wikimedia.org/T125731

Cheers!


On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
wrote:

> John, thank you a lot for taking the time to answer my question. My
> responses inline (I rearranged some of your paragraphs to respond to them
> together):
>
> I think you need to clearly define what you want to capture and
>> classify, and re-evaluate what change to the user-agent policy will
>> have any noticeable impact on your detection accuracy in the next five
>> years.
>
> &
>
>> If you do not want Googlebot to be grouped together with api based
>> bots , either the user-agent need to use something more distinctive,
>> such as 'MediaWikiBot', or you will need another regex of all the
>> 'bot' matches which you dont want to be a bot.
>
> &
>
>> The `analytics-refinery-source` code currently differentiates between
>> spider and bot, but earlier in this thread you said
>>   'I don't think we need to differentiate between "spiders" and "bots".'
>> If you require 'bot' in the user-agent for bots, this will also
>> capture Googlebot and YandexBot, and many other tools which use 'bot'
>> .  Do you want Googlebot to be a bot?
>> But Yahoo! Slurp's useragent doesnt include bot will not.
>> So you will still need a long regex for user-agents of tools which you
>> can't impose this change onto.
>
> Differentiating between "spiders" and "bots" can be very tricky, as you
> explain. There was some work on it in the past, but what we really want at
> the moment is: to split the human vs bot traffic with a higher accuracy. I
> will add that to the docs, thanks. Regarding measuring the impact, as we'll
> not be able to differentiate "spiders" and "bots", we can only observe the
> variations of the human vs bot traffic rates in time and try to associate
> those to recent changes in User-Agent strings or regular expressions.
>
> The eventual definition of 'bot' will be very central to this issue.
>> Which tools need to start adding 'bot'?  What is 'human' use?  This
>> terminology problem has caused much debate on the wikis, reaching
>> arbcom several times.  So, precision in the definition will be quite
>> helpful.
>
> Agree, will add that to the proposal.
>
> One of the strange area's to consider is jquery-based tools that are
>> effectively bots, performing large numbers of operations on pages in
>> batches with only high-level commands being given by a human.  e.g.
>> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
>> scripts are also not a 'bot'.
>
> I think the key here is: the program should be tagged as a bot by
> analytics, if it generates pageviews not consumed onsite by a human. I will
> mention that in the docs, too. Thanks.
>
>
>> If gadgets and user-scripts may need to follow the new 'bot' rule of
>> the user-agent policy, the number of developers that need to be
>> engaged is much larger.
>
> &
>
>> Please understand the gravity of what you are imposing.  Changing a
>> user-agent of a client is a breaking change, and any decent MediaWiki
>> client is also used by non-Wikimedia wikis, administrated by
>> non-Wikimedia ops teams, who may have their own tools doing analysis
>> of user-agents hitting their servers, possibly including access
>> control rules.  And their rules and scripts may break when a client
>> framework changes its user-agent in order to make the Wikimedia
>> Analytics scripts easier.  Strictly speaking your user-agent policy
>> proposal requires a new _major_ release for every client framework
>> that you do not grandfather into your proposed user-agent policy.
>
&g

Re: [Analytics] WikimediaBot convention

2016-02-02 Thread Marcel Ruiz Forns

Hi all,

It seems comments are decreasing at this point. I'd like to slowly drive
this thread to a conclusion.


3. Create a plan to block clients that dont implement the (amended)
> User-Agent policy.


I think we can decide on this later. Steps 1) and 2) can be done first -
they should be done anyway before 3) - and then we can see how much benefit
we raise from them. If we don't get a satisfactory reaction from
bot/framework maintainers, we then can go for 3). John, would you be OK
with that?


If no-one else raises concerns about this, the Analytics team will:

   1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy,
   to encourage including the word "bot" (case-insensitive) in the User-Agent
   string, so that bots can be easily identified.

   2. Advertise the convention and reach out to bot/framework maintainers
   to increase the share of bots that implement the User-Agent policy.


Thanks!

On Tue, Feb 2, 2016 at 5:21 AM, Bryan Davis <bd...@wikimedia.org> wrote:

> On Mon, Feb 1, 2016 at 11:42 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> >>It will take time for frameworks to implement an amended User-Agent
> policy.
> >>For example, pywikipedia (pywikibot compat) is not actively
> >>maintained.
> > That doesn't imply we shouldn't have a  policy that anyone can refer to,
> > these bots will not follow it until they get some maintainers.
> >
> >>There was a task filled against Analytics for this, but Dan Andreescu
> >>removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
> >
> > Sorry that the tagging is confusing. I think Analytics tag was removed
> cause
> > this is a request for data and our team doesn't do data retrieval. We
> > normally tag with "analytics" phabricator items that have actionables for
> > our team.
> > I am cc-ing Bryan who has already done some analysis on bots requests to
> the
> > API and can probably provide some data.
>
> It would be possible to make some relative comparisons of pywikibot
> versions using the data that is currently collected in the
> wmf.webrequest data set. "Someday" I'll get T108618 [0] finished which
> will make answering some of the more granular questions in T99373
> easier. Kunal talked with Brad and I a few weeks ago when we were all
> in SF for the DevSummit about other instrumentation that could be put
> in place specifically for pywikibot so that something like
> Special:ApiFeatureUsage [2] could be created for pywikibot version
> tracking as well. This all seems like a fork of the topic at hand
> however.
>
> [0]: https://phabricator.wikimedia.org/T108618
> [1]: https://phabricator.wikimedia.org/T99373
> [2]: https://en.wikipedia.org/wiki/Special:ApiFeatureUsage
>
> Bryan
> --
> Bryan Davis  Wikimedia Foundation<bd...@wikimedia.org>
> [[m:User:BDavis_(WMF)]]  Sr Software EngineerBoise, ID USA
> irc: bd808        v:415.839.6885 x6855
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikimediaBot convention

2016-02-01 Thread Marcel Ruiz Forns

>
> >Another option to this thread would be: cancelling the convention and
> continue working on regexps
> I think regardless of our convention we will always be doing regex
> detection of self-identified bots. Maybe I am missing some nuance here?

No, no, Nuria, you're right. I meant continue to improve the regexps and
the other means we have to identify bots. I didn't imply that we should
stop doing regexps if we establish the convention.

On Mon, Feb 1, 2016 at 7:44 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> >In the past, the Analytics team also considered enforcing the convention
> by blocking those bots that don't follow it. And that is still an option to
> consider.
> I would like to point out that I think this is probably the prerogative of
> api's team rather than analytics.
>
>
> >Another option to this thread would be: cancelling the convention and
> continue working on regexps
> I think regardless of our convention we will always be doing regex
> detection of self-identified bots. Maybe I am missing some nuance here?
>
>
>
>
>
> On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> >It will take time for frameworks to implement an amended User-Agent
>> policy.
>> >For example, pywikipedia (pywikibot compat) is not actively
>> >maintained.
>> That doesn't imply we shouldn't have a  policy that anyone can refer to,
>> these bots will not follow it until they get some maintainers.
>>
>> >There was a task filled against Analytics for this, but Dan Andreescu
>> >removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
>>
>> Sorry that the tagging is confusing. I think Analytics tag was removed
>> cause this is a request for data and our team doesn't do data retrieval. We
>> normally tag with "analytics" phabricator items that have actionables for
>> our team.
>> I am cc-ing Bryan who has already done some analysis on bots requests to
>> the API and can probably provide some data.
>>
>>
>>
>>
>> On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <jay...@gmail.com>
>> wrote:
>>
>>> Hi Marcel,
>>>
>>> It will take time for frameworks to implement an amended User-Agent
>>> policy.
>>> For example, pywikipedia (pywikibot compat) is not actively
>>> maintained.  We dont know how much traffic is generated by compat.
>>> There was a task filled against Analytics for this, but Dan Andreescu
>>> removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
>>>
>>> There are a lot of clients that need to be upgraded or be
>>> decommissioned for this 'add bot' strategy to be effective in the near
>>> future.  see https://www.mediawiki.org/wiki/API:Client_code
>>>
>>> The all important missing step is
>>>
>>> 3. Create a plan to block clients that dont implement the (amended)
>>> User-Agent policy.
>>>
>>> Without that plan, successfully implemented, you will not get quality
>>> data (i.e. using 'Netscape' in the U-A to guess 'human' would perform
>>> better).
>>>
>>> On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <mfo...@wikimedia.org>
>>> wrote:
>>> > So, trying to join everyone's points of view, what about?
>>> >
>>> > Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy
>>> and
>>> > modify it to encourage adding the word "bot" (case-insensitive) to the
>>> > User-Agent string, so that it can be easily used to identify bots in
>>> the
>>> > anlytics cluster (no regexps). And link that page from whatever other
>>> pages
>>> > we think necessary.
>>> >
>>> > Do some advertising and outreach and get some bot maintainers and
>>> maybe some
>>> > frameworks to implement the User-Agent policy. This would make the
>>> existing
>>> > policy less useless.
>>> >
>>> > Thanks all for the feedback!
>>> >
>>> > On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <
>>> mfo...@wikimedia.org>
>>> > wrote:
>>> >>>
>>> >>> Clearly Wikipedia et al. uses bot to refer to automated software that
>>> >>> edits the site but it seems like you are using the term bot to refer
>>> to all
>>> >>> automated software and it might be good to clarify.
>>> >>
>>> >>
>>> >> OK, in the documentation we can make tha

Re: [Analytics] WikimediaBot convention

2016-02-01 Thread Marcel Ruiz Forns

So, trying to join everyone's points of view, what about?

   1. Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and
   modify it to encourage adding the word "bot" (case-insensitive) to the
   User-Agent string, so that it can be easily used to identify bots in the
   anlytics cluster (no regexps). And link that page from whatever other pages
   we think necessary.

   2. Do some advertising and outreach and get some bot maintainers and
   maybe some frameworks to implement the User-Agent policy. This would make
   the existing policy less useless.

Thanks all for the feedback!

On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
wrote:

> Clearly Wikipedia et al. uses bot to refer to automated software that
>> edits the site but it seems like you are using the term bot to refer to all
>> automated software and it might be good to clarify.
>
>
> OK, in the documentation we can make that clear. And looking into that,
> I've seen that some bots, in the process of doing their "editing" work can
> also generate pageviews. So we should also include them as potential source
> of pageview traffic. Maybe we can reuse the existing User-Agent policy.
>
>
> This makes a lot of sense. If I build a bot that crawls wikipedia and
>> facebook public pages it really doesn't make sense that my bot has a
>> "wikimediaBot" user agent, just the word "Bot"  should probably be enough.
>
>
> Totally agree.
>
>
> I guess a bigger question is why try to differentiate between "spiders"
>> and "bots" at all?
>
>
> I don't think we need to differentiate between "spiders" and "bots". The
> most important question we want to respond is: how much of the traffic we
> consider "human" today is actually "bot". So, +1 "bot" (case-insensitive).
>
>
> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <jay...@gmail.com>
> wrote:
>
>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <mfo...@wikimedia.org>
>> wrote:
>> >>
>> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses all
>> sites that the client
>> >> can communicate with.
>> >
>> > I personally agree with you, "MediaWikiBot" seems to have better
>> semantics.
>>
>> For clients accessing the MediaWiki api, it is redundant.
>> All it does is identify bots that comply with this edict from analytics.
>>
>> --
>> John Vandenberg
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] WikimediaBot convention

2016-01-27 Thread Marcel Ruiz Forns

Hi analytics list,

In the past months the WikimediaBot convention has been mentioned in a
couple threads, but we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today and
what purpose it serves. And also ask for feedback to be sure we can
continue with the next steps.

What is the WikimediaBot convention?
It is a way of better identifying Wikimedia traffic originated by bots.
Today we know that a significant share of Wikimedia traffic comes from
bots. We can recognize a part of that traffic with regular expressions[1],
but we can not recognize all of it, because some bots do not identify
themselves as such. If we could identify a greater part of the bot traffic,
we could also better isolate the human traffic and permit more accurate
analyses.

Who should follow the convention?
Computer programs that access Wikimedia sites or the Wikimedia API for
reading purposes* in a periodic, scheduled or automatically triggered way.

Who should NOT follow the convention?
Computer programs that follow the on-site ad-hoc commands of a human, like
browsers. And well known spiders that are otherwise recognizable by their
well known user-agent strings.

How to follow the convention?
The client's user-agent string should contain the word "WikimediaBot". The
word can be anywhere within the user-agent string and is case-sensitive.

So, please, feel free to post your comments/feedback on this thread. In the
course of this discussion we can adjust the convention's definition and, if
no major concerns are raised, in 2 weeks we'll create a documentation page
in Wikitech, send an email to the proper mailing lists and maybe write a
blog post about it.

Thanks a lot!

(*) There is already another convention[2] for bots that EDIT Wikimedia
content.

[1]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L49
[2] https://www.mediawiki.org/wiki/Manual:Bots

--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] MobileWikiAppShareAFact event stream was: [WikimediaMobile] Stopping eventlogging events into MobileWikiAppShareAFact table

2016-01-04 Thread Marcel Ruiz Forns

Thanks Tilman,

It makes sense to reduce the sampling rate of the schema for
"Datensparsamkeit and faster queries". However, if you don't specifically
need MySQL, and are fine querying through Hive, we could continue storing
all events at the current 1% rate in Hadoop.

On Mon, Jan 4, 2016 at 11:28 AM, Tilman Bayer <tba...@wikimedia.org> wrote:

> Hi Marcel,
>
> yes, this is to be expected, because the schema is now logging more
> kinds of events than before. However, we could reduce the sampling
> rate considerably, as JonR and I had already envisaged
> (https://phabricator.wikimedia.org/T120292#1854136 ; this got lost a
> bit among the other schema changes, cf.
> https://phabricator.wikimedia.org/T120292#1864549 ).
>
> On Sun, Jan 3, 2016 at 12:30 PM, Marcel Ruiz Forns <mfo...@wikimedia.org>
> wrote:
> > BTW, MobileWebSectionUsage schema is sending a lot of events since Dec
> 18,
> > 2015.
> > It normally would send around 40 events per second, and it's sending
> around
> > 120 events per second now. It's now the highest throughput schema in EL
> by
> > far. Is that expected?
> >
> > Sorry for using this same thread. If this needs to be taken care of, I
> will
> > create a new task.
> > Thanks!
> >
> >
> > On Tue, Dec 29, 2015 at 8:41 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> >>
> >> Sorry i misses this but it always has sent events to a real high volume.
> >>
> >> On Tue, Dec 22, 2015 at 10:25 AM, Jon Katz <jk...@wikimedia.org> wrote:
> >>>
> >>> + Dmitry
> >>>
> >>> Hi Nuria,
> >>> I will ask Dmitry to confirm, but I think a pause is fine for the next
> >>> couple of days as long as we are given the timestamps for outage can
> note it
> >>> on the schema wiki page.  Is this a sudden increase or has it always
> been
> >>> sending to high of a volume?  Regardless, I imagine a higher sampling
> rate
> >>> can probably be applied.
> >>> -J
> >>>
> >>> On Tue, Dec 22, 2015 at 9:58 AM, Nuria Ruiz <nu...@wikimedia.org>
> wrote:
> >>>>
> >>>> Team:
> >>>>
> >>>> This  schema MobileWikiAppShareAFact is sending a lot of events, maybe
> >>>> is worth thinking whether we need that many. It is again a case where
> tables
> >>>> are becoming huge and hard to query fast.
> >>>>
> >>>> cc-ing Jon as schema owner.
> >>>>
> >>>> Can this data be sampled at a higher sampling rate? I have filed a
> >>>> ticket to this fact:
> >>>> https://phabricator.wikimedia.org/T14
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Nuria
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <ab...@wikimedia.org>
> wrote:
> >>>>>
> >>>>> Replacing mobile-tech with mobile-l (internal mobile-tech list
> >>>>> discontinued).
> >>>>>
> >>>>>
> >>>>> On Tuesday, December 22, 2015, Nuria Ruiz <nu...@wikimedia.org>
> wrote:
> >>>>>>
> >>>>>> Team:
> >>>>>>
> >>>>>> As part of our effort of converting eventlogging mysql database to
> the
> >>>>>> tokudb engine we need to stop eventlogging events from flowing into
> the
> >>>>>> MobileWikiAppShareAFact table, we are using this one table to see
> how long
> >>>>>> the conversion will take in order to plan for a larger outage
> window.
> >>>>>>
> >>>>>>
> >>>>>> Let us know if data should be backfilled as it can be, we anticipate
> >>>>>> events will not flow into table for the better part of one day.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Nuria
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> ___
> >>>>> Mobile-l mailing list
> >>>>> mobil...@lists.wikimedia.org
> >>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
> >>>>>
> >>>>
> >>>
> >>
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >
> >
> >
> > --
> > Marcel Ruiz Forns
> > Analytics Developer
> > Wikimedia Foundation
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] MobileWikiAppShareAFact event stream was: [WikimediaMobile] Stopping eventlogging events into MobileWikiAppShareAFact table

2016-01-03 Thread Marcel Ruiz Forns

BTW, MobileWebSectionUsage schema is sending a lot of events since Dec 18,
2015.
It normally would send around 40 events per second, and it's sending around
120 events per second now. It's now the highest throughput schema in EL by
far. Is that expected?

Sorry for using this same thread. If this needs to be taken care of, I will
create a new task.
Thanks!


On Tue, Dec 29, 2015 at 8:41 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> Sorry i misses this but it always has sent events to a real high volume.
>
> On Tue, Dec 22, 2015 at 10:25 AM, Jon Katz <jk...@wikimedia.org> wrote:
>
>> + Dmitry
>>
>> Hi Nuria,
>> I will ask Dmitry to confirm, but I think a pause is fine for the next
>> couple of days as long as we are given the timestamps for outage can note
>> it on the schema wiki page.  Is this a sudden increase or has it always
>> been sending to high of a volume?  Regardless, I imagine a higher sampling
>> rate can probably be applied.
>> -J
>>
>> On Tue, Dec 22, 2015 at 9:58 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>
>>> Team:
>>>
>>> This  schema MobileWikiAppShareAFact is sending a lot of events, maybe
>>> is worth thinking whether we need that many. It is again a case where
>>> tables are becoming huge and hard to query fast.
>>>
>>> cc-ing Jon as schema owner.
>>>
>>> Can this data be sampled at a higher sampling rate? I have filed a
>>> ticket to this fact:
>>> https://phabricator.wikimedia.org/T14
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>>
>>>
>>> On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <ab...@wikimedia.org> wrote:
>>>
>>>> Replacing mobile-tech with mobile-l (internal mobile-tech list
>>>> discontinued).
>>>>
>>>>
>>>> On Tuesday, December 22, 2015, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>>>
>>>>> Team:
>>>>>
>>>>> As part of our effort of converting eventlogging mysql database to the
>>>>> tokudb engine we need to stop eventlogging events from flowing into the 
>>>>> MobileWikiAppShareAFact
>>>>> table, we are using this one table to see how long the conversion will 
>>>>> take
>>>>> in order to plan for a larger outage window.
>>>>>
>>>>>
>>>>> Let us know if data should be backfilled as it can be, we anticipate
>>>>> events will not flow into table for the better part of one day.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Nuria
>>>>>
>>>>>
>>>>>
>>>> ___
>>>> Mobile-l mailing list
>>>> mobil...@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>>
>>>>
>>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Top Edits/Views in 2015 per project?

2016-01-03 Thread Marcel Ruiz Forns

>
> I'm curios what these difficulties are that prevent the calculation of
> monthly top viewed pages. After a request from the WMF Communications
> team, I generated a list of the 200 most viewed pages from May 2015 to
> October 2015 as input for the #Edit2015 video (cf.
> https://phabricator.wikimedia.org/T117945 ). IIRC that query did not
> take terribly long to complete.


The current top endpoint has breakdowns by project (~800 wikis) and access
method (desktop / mobile-web / mobile-app). I think this makes the
computation harder as opposed to a global query.


On Sat, Jan 2, 2016 at 2:00 PM, Tilman Bayer <tba...@wikimedia.org> wrote:

> On Wed, Dec 16, 2015 at 10:58 AM, Dan Andreescu
> <dandree...@wikimedia.org> wrote:
> > Itzik,
> >
> > The way we're computing top pageviews right now doesn't scale very well,
> we
> > aren't even able to properly do monthly top pages.  So we opened this
> issue:
> > https://phabricator.wikimedia.org/T120113.  When we fix that, it's
> possible
> > we'll be able to get yearly top pages too, but I'm not promising
> anything :)
> I'm curios what these difficulties are that prevent the calculation of
> monthly top viewed pages. After a request from the WMF Communications
> team, I generated a list of the 200 most viewed pages from May 2015 to
> October 2015 as input for the #Edit2015 video (cf.
> https://phabricator.wikimedia.org/T117945 ). IIRC that query did not
> take terribly long to complete.
>
> >
> > As for most edited articles, that can be done with a simple query on each
> > database, but it would have really bad performance, like maybe it would
> > never terminate.  When we figure out how to compute top pageviews faster
> > (bloom filters maybe?) we'll surface the solution and then it'll be
> pretty
> > easy to get top edited.
> >
> Don't know about bloom filters, but note that one does not need to use
> the webrequest data for this - MediaWiki itself stores every edit in a
> revision table. Since this thead, AaronH has provided this data (on
> the request of the WMF Communications team, as he did last year) at
> https://phabricator.wikimedia.org/T122604 .
>
> > On Wed, Dec 16, 2015 at 1:52 PM, Itzik - Wikimedia Israel
> > <it...@wikimedia.org.il> wrote:
> >>
> >> Hi,
> >>
> >> I see that the (amazing!) API still can't give us results for the whole
> >> 2015. So any way we can get this pages views per project? And also, the
> most
> >> edited articles in 2015 per project?
> >>
> >> This can be a great PR information for the communication representatives
> >> around to world to release to local journalists.
> >>
> >>
> >> Regards,
> >> Itzik Edri
> >> Chairperson, Wikimedia Israel
> >> +972-(0)-54-5878078 | http://www.wikimedia.org.il
> >> Imagine a world in which every single human being can freely share in
> the
> >> sum of all knowledge. That's our commitment!
> >>
> >>
> >>
>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Echo schema eventlogging

2015-12-16 Thread Marcel Ruiz Forns

Just spoke with Jaime Crespo and he confirmed that:

   - m4-master (master EL database) only holds events for the last 45
   days to avoid space problems. That's for all tables including Echo.

   - analytics-storage is the replica that keeps the historical data and is
   meant to apply the specific purging strategy agreed in the schema's talk
   page. This database does not have space problems (yet).


On Wed, Dec 16, 2015 at 2:14 AM, Aaron Halfaker <ahalfa...@wikimedia.org>
wrote:

> No!  Please do not nuke old data.  +1 to J-Mo.  This will probably be
> useful for long-term studies of notifications.  If I had the time, I'd pick
> it up right now based on this reminder!
>
> I'm happy with having historical data preserved (please makes sure that it
> is) and the MySQL table dropped until a recent point.  It will be important
> that we can come back to this later and either restore the data or query it
> in it's entirety from hadoop.
>
> -Aaron
>
> On Tue, Dec 15, 2015 at 1:34 PM, Madhumitha Viswanathan <
> mviswanat...@wikimedia.org> wrote:
>
>> I want to mention that data in Hadoop is only available from Aug 27th
>> 2015. Older data is only available in mysql.
>>
>> On Tue, Dec 15, 2015 at 11:27 AM, Roan Kattouw <rkatt...@wikimedia.org>
>> wrote:
>>
>>> If the data is going to be retained but would just become harder to
>>> query (i.e. still in Hadoop but not in mysql), maybe we could nuke data
>>> that's more than a year old (or 6 months old or something) from mysql?
>>>
>>> On Tue, Dec 15, 2015 at 9:35 AM, Andrew Otto <ao...@wikimedia.org>
>>> wrote:
>>>
>>>> We could blacklist this schema from the mysql database, and still keep
>>>> producing it.  It would be available in Hadoop either way.
>>>>
>>>>
>>>> On Dec 15, 2015, at 12:22, Jonathan Morgan <jmor...@wikimedia.org>
>>>> wrote:
>>>>
>>>> Hi Nuria,
>>>>
>>>> FWIW: Although I'm not using this right now, but I could see it being
>>>> useful for understanding the impact of new notification updates that are
>>>> coming down the pike.[1][2]
>>>>
>>>> What are the costs involved in keeping this schema up?
>>>>
>>>> Best,
>>>> J
>>>>
>>>> 1.
>>>> https://meta.wikimedia.org/wiki/Research:Cross-wiki_notifications_user_research
>>>> 2. https://phabricator.wikimedia.org/T116741
>>>>
>>>> On Tue, Dec 15, 2015 at 8:22 AM, Nuria Ruiz <nu...@wikimedia.org>
>>>> wrote:
>>>>
>>>>> Roan:
>>>>>
>>>>> The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo)
>>>>> is quite large and we are not sure is even used.
>>>>>
>>>>> Can you confirm either way? If it is no longer used we will stop
>>>>> collecting it.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Nuria
>>>>>
>>>>> ___
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan T. Morgan
>>>> Senior Design Researcher
>>>> Wikimedia Foundation
>>>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> --Madhu :)
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Echo schema eventlogging

2015-12-16 Thread Marcel Ruiz Forns

>
> Sure, it doesn't have space problems, but the problem remains that with a
> table this large, it's impossible to query and get results in our lifetime.

I see, makes sense.

I think in this case moving all of the data to Hadoop and blacklisting it
> from the mysql inserter seems like the right thing to do.

I agree. We should implement partial auto-purging in Hadoop though. In the
Echo schema some fields should still be purged.

On Wed, Dec 16, 2015 at 3:07 PM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> Just spoke with Jaime Crespo and he confirmed that:
>>
>>- m4-master (master EL database) only holds events for the last 45
>>days to avoid space problems. That's for all tables including Echo.
>>
>>- analytics-storage is the replica that keeps the historical data and
>>is meant to apply the specific purging strategy agreed in the schema's 
>> talk
>>page. This database does not have space problems (yet).
>>
>> Sure, it doesn't have space problems, but the problem remains that with a
> table this large, it's impossible to query and get results in our
> lifetime.  So we need to come up with some better solutions where we have
> these huge volumes of valuable data.  I think in this case moving all of
> the data to Hadoop and blacklisting it from the mysql inserter seems like
> the right thing to do.  The only reason for data to exist in mysql should
> be if we're querying data on a frequent period basis and taking actions
> based on the results of those queries.  Otherwise it's a waste of resources
> and we should allocate that disk space to something else.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Outage] Small data loss in raw_webrequest on 2015-12-15

2015-12-16 Thread Marcel Ruiz Forns

Hi Analytics,

Yesterday, Dec 15, during the course of 1 hour (17h to 18h UTC) there was
an irrecoverable raw_webrequest data loss of ~30%: 25.6% (misc), 19.5%
(mobile), 19.1% (text), 39.1% (upload). This represents around 1% of the
data for that day.

The loss was due to the enabling of IPSec, which encrypts varniskafka
traffic between caches in remote datacenters and the Kafka brokers in
eqiad. During a period of about 40ish minutes, no webrequest logs from
remote datacenters were successfully produced to Kafka.

Here's the outage note:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_known_problems_since_2015-03-04
Sorry for the inconvenience.

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] EventLogging database outage next Tuesday 2015-12-15 10:00 UTC (2 hours)

2015-12-11 Thread Marcel Ruiz Forns

Hi Analytics,

Next Tuesday, Dec 15, between 10:00 and 12:00 UTC
EventLogging's database m4-master will be down for maintenance.

Impact of the outage:

   - Events generated during the mentioned time window will wait until the
   outage is over to be inserted in the m4-master database. These events and
   the ones generated immediately after the outage can take a couple hours to
   catch up and be inserted.

Everything else should work as normal during the outage:

   - Events can be sent normally to EventLogging server, they'll be
   buffered in Kafka.
   - Events will be written to the logs, and forwarded to other systems
   normally. The only process that will stop is the mysql-consumer.
   - Queries to analytics-store can be executed normally. The outage is
   local to m4-master and does not affect replicas.

Feel free to ask any questions you have on this.

Cheers!

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging outage in progress?

2015-11-30 Thread Marcel Ruiz Forns

Team, I checked and, indeed, EventLogging database needs backfilling from
2015-11-27 01:00 until 2015-11-27 07:00. I updated the docs and started the
backfilling process. I'll let you know when it it finished.
Cheers

On Fri, Nov 27, 2015 at 8:31 PM, Oliver Keyes <oke...@wikimedia.org> wrote:

> It seems like it would depend on the class of error. 48 hours for
> events not syncing, fine. 48 hours of /total data loss/ is a
> completely different class of problem.
>
> On 27 November 2015 at 11:35, Nuria Ruiz <nu...@wikimedia.org> wrote:
> >>Unfortunately, the only team-members working full-time yesterday and
> today
> >> are we Europe folks.
> >>We weren't there when that happened and we don't get those alerts on the
> >> phone, we should though.
> > Given that this system is tier-2 i do not think we need an immediate
> > response, 24 hours should be an acceptable ETA. I would say even 48.
> >
> > On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns <mfo...@wikimedia.org
> >
> > wrote:
> >>
> >> Thanks, Ori, for having a look at this and restarting EL.
> >>
> >> I understand it was 01:30 UTC on Friday (today), not Thursday. It went
> on
> >> during 5-6 hours.
> >> Unfortunately, the only team-members working full-time yesterday and
> today
> >> are we Europe folks.
> >> We weren't there when that happened and we don't get those alerts on the
> >> phone, we should though.
> >>
> >> This problem happened already like a month ago. We'll backfill the
> missing
> >> events and will investigate.
> >> Thanks again for the heads-up.
> >>
> >> On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh <o...@wikimedia.org> wrote:
> >>>
> >>> On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh <o...@wikimedia.org>
> wrote:
> >>>>
> >>>> Seems that eventlog1001 has not received any events since 01:30 UTC on
> >>>> Thursday
> >>>>
> >>>>
> >>>>
> http://ganglia.wikimedia.org/latest/graph.php?r=day=xlarge=Miscellaneous+eqiad=eventlog1001.eqiad.wmnet===hide=0=140128.28=bytes_in=bytes%2Fsec=Bytes+Received
> >>>>
> >>>> This is pretty severe; I'd page if it wasn't a US holiday.
> >>>
> >>>
> >>> Kafka clients on eventlog1001 were in a "Autocommitting consumer
> offset"
> >>> death-loop and not receiving any events from the Kafka brokers. I ran
> >>> eventloggingctl stop / eventloggingctl start and they recovered. Needs
> to be
> >>> investigated more thoroughly. Otto, can you follow up?
> >>>
> >>>
> >>> ___
> >>> Analytics mailing list
> >>> Analytics@lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>
> >>
> >>
> >>
> >> --
> >> Marcel Ruiz Forns
> >> Analytics Developer
> >> Wikimedia Foundation
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging outage in progress?

2015-11-27 Thread Marcel Ruiz Forns

Thanks, Ori, for having a look at this and restarting EL.

I understand it was 01:30 UTC on Friday (today), not Thursday. It went on
during 5-6 hours.
Unfortunately, the only team-members working full-time yesterday and today
are we Europe folks.
We weren't there when that happened and we don't get those alerts on the
phone, we should though.

This problem happened already like a month ago. We'll backfill the missing
events and will investigate.
Thanks again for the heads-up.

On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh <o...@wikimedia.org> wrote:

> On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh <o...@wikimedia.org> wrote:
>
>> Seems that eventlog1001 has not received any events since 01:30 UTC on
>> Thursday
>>
>>
>> http://ganglia.wikimedia.org/latest/graph.php?r=day=xlarge=Miscellaneous+eqiad=eventlog1001.eqiad.wmnet===hide=0=140128.28=bytes_in=bytes%2Fsec=Bytes+Received
>>
>> This is pretty severe; I'd page if it wasn't a US holiday.
>>
>
> Kafka clients on eventlog1001 were in a "Autocommitting consumer offset"
> death-loop and not receiving any events from the Kafka brokers. I ran
> eventloggingctl stop / eventloggingctl start and they recovered. Needs to
> be investigated more thoroughly. Otto, can you follow up?
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Changes on eventlogging

2015-11-03 Thread Marcel Ruiz Forns

Awesome Jaime, thanks!

On Tue, Nov 3, 2015 at 6:25 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> Jaime:
>
> (Adding analytics e-mail list)
>
> Please send notes regarding eventlogging to analytics@
>
> Thanks,
>
> Nuria
>
>
> On Tue, Nov 3, 2015 at 9:03 AM, Jaime Crespo <jcre...@wikimedia.org>
> wrote:
>
>> As actionables of
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20151022-EventLogging#Actionables
>>
>> I have migrated the replication method used by Sean to puppet and added
>> monitoring (which was missing initially). While the current state could be
>> iteratively improved, it no longer depends from a single process on a
>> single machine, that could be restarted or fail at any time. We also have
>> logs and alerts to identify issues immediately. This will also allow
>> purging rows easier and faster in the short future.
>>
>> Of course, any migration could have regressions, so please monitor any
>> issue you may find (as I am currently doing, and I have not yet found).
>> This will hopefully prevent the issue to happen again.
>>
>> Regards,
>>
>> --
>> Jaime Crespo
>> <http://wikimedia.org>
>>
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Event Logging incident

2015-10-27 Thread Marcel Ruiz Forns

Yestersday, we finally completed the backfilling of the affected time range.
In the end we managed to get the missing data from the archived logs.
So, please, re-run any reports for October 14th, between 06:00 UTC and
21:00 UTC.
Thank you, and apologies for any inconvenience.



On Thu, Oct 15, 2015 at 11:07 PM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> quick follow up: it looks like the majority of data was not recovered, so
> re-running your reports won't help.  We'll try to recover what's missing if
> we still have it in Kafka.  Let us know if you need to be kept up to date
> about this.
>
> On Thu, Oct 15, 2015 at 4:31 PM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> The consumer that writes Event Logging events into the SQL database was
>> down yesterday for about 9 hours.  We restarted it and it consumed the data
>> it missed, and inserted it into the database.  The incident report is here:
>>
>>
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20151015-EventLogging
>>
>> We don't yet know if any data was lost, I'm going to run some queries now
>> on a few schemas and I'll update the incident report.
>>
>> If you had reports run on October 14th, between 06:00 UTC and 21:00 UTC,
>> you should re-run them.
>>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Canonical location for metrics documentation

2015-10-14 Thread Marcel Ruiz Forns

>
> I don't think that is feasible or reasonable for the documentation that is
> currently on Meta.


What would be the drawbacks?

On Wed, Oct 14, 2015 at 3:38 PM, Aaron Halfaker <ahalfa...@wikimedia.org>
wrote:

> I propose we move everything to wikitech now
>
>
> I don't think that is feasible or reasonable for the documentation that is
> currently on Meta.
>
> On Wed, Oct 14, 2015 at 9:34 AM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> We have a documentation cleanup day coming up soon, and we've just got
>> delete permissions so we can actually clean.  We've been moving everything
>> Analytics-infrastructure related to wikitech and that's where we'd prefer
>> to see everything.  The nuanced purpose of each wiki is great, but before
>> we can get to that, we have to establish a trusted, complete, and
>> discoverable source of documentation.  Then we can start catering to the
>> audiences of each wiki.
>>
>> I propose we move everything to wikitech now, and establish a single page
>> on meta and mediawiki that point to the different main pages on wikitech.
>>
>> On Wed, Oct 14, 2015 at 1:14 AM, Federico Leva (Nemo) <nemow...@gmail.com
>> > wrote:
>>
>>> Neil P. Quinn, 14/10/2015 02:30:
>>>
>>>> We currently have metrics documentation in two different places
>>>>
>>>
>>> What sort of documentation do you have in mind? Meta has the definitions
>>> which WMF hopes to see used in other fields as well, while MediaWiki.org
>>> and wikitech have technical documentation about stats.wikimedia.org and
>>> other stuff produced by Analytics.
>>>
>>> Nemo
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> _______
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] General framework for updating database reports

2015-10-06 Thread Marcel Ruiz Forns

Dan, thanks for the careful explanation.

I wanted to add that there is a small documentation on Wikitech for the
reportupdater tool:
https://wikitech.wikimedia.org/wiki/Analytics/Reportupdater

Cheers!


On Tue, Oct 6, 2015 at 6:38 PM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> Hi Aaron,
>
> I like the tool Marcel built in the spring.  It's called reportupdater and
> it's been pretty stable and useful but it's not documented because we
> haven't publicized it yet.  What it does is allow you to configure
> templates for SQL or shell scripts that take parameters and generate
> separated value files as output.  You can specify the time granularity that
> you want results for and it will re-run jobs for time periods that don't
> exist in the output (because of failures, etc.).  It also does other useful
> things like reports errors like a champ and ensures only one instance is
> running at any given time.  You can even change your scripts to output new
> columns or re-arrange the column order and it will morph the output files
> to match the new header (you just can't remove columns - because that's
> crazy!).
>
> If you wanna talk more about it I'd like to give you the details privately
> because I'd want to start documenting this tool properly as I do.
>
> On Tue, Oct 6, 2015 at 12:16 PM, Aaron Halfaker <ahalfa...@wikimedia.org>
> wrote:
>
>> Hey folks,
>>
>> I know there was some work in the past on systems to support keeping
>> database reports up to date.  I'm looking into this type of work with Jeph
>> Paul now and I realized I don't have any good pointers to this past work.
>> Right now, we're looking at running database reports based on cron jobs and
>> checking the recentchanges table to make sure that replication isn't too
>> lagged.  Is there a better way?
>>
>> FWIW, I expect these queries to run daily and have a runtime of up to an
>> hour.
>>
>> -Aaron
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Survey] Pageview API

2015-09-11 Thread Marcel Ruiz Forns

+1 Adam

Also, maybe *top-articles* instead of *top*, to avoid naming collision in
the future?

On Sat, Sep 12, 2015 at 12:27 AM, Adam Baso <ab...@wikimedia.org> wrote:

> I'd be in favor of both. Maybe with a little tweak to the pathing:
>
> /top/{project}/{access}/days/{days-in-the-past}
>
>  /top/{project}/{access}/range/{start}/{end}
>
> with "days" or "range" maybe being earlier in the forward slash separated
> spec if it doesn't read well semantically.
>
>
> On Fri, Sep 11, 2015 at 3:14 PM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> It wouldn't be too hard to offer both, but I'm thinking it might be
>> confusing for a consumer.  I think ultimately the decision should be up to
>> the people using this data, because the use cases are fairly different for
>> each form.  If people ask for both, we'll do both.
>>
>> Leila, we'd love to have page_ids as well, but we'd have to block the
>> release on a bigger effort to reliably mirror mediawiki databases in Hadoop
>> for processing, so we'll probably punt on that for now.  But we have more
>> than many reasons to work on that sooner than later.
>>
>> On Fri, Sep 11, 2015 at 6:09 PM, Gabriel Wicke <gwi...@wikimedia.org>
>> wrote:
>>
>>> The former might be slightly easier to cache, and can be linked to /
>>> pulled in statically, without a need to dynamically construct a URL. Would
>>> it be hard to offer both?
>>>
>>> On Fri, Sep 11, 2015 at 3:06 PM, Leila Zia <le...@wikimedia.org> wrote:
>>>
>>>> It's getting exciting. :-)
>>>>
>>>> I'd go with choice 2 since it gives more control to the user while
>>>> offering what the user can get through choice 1 as well.
>>>>
>>>> Question: will we get page_ids or page_titles or both? It's good to
>>>> have both.
>>>>
>>>> Leila
>>>>
>>>> On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu <dandreescu@wikimedia
>>>> .org> wrote:
>>>>
>>>>> Hi everyone.  End of quarter is rapidly approaching and I wanted to
>>>>> ask a quick question about one of the endpoints we want to push out.  We
>>>>> want to let you ask "what are the top articles" but we're not sure how to
>>>>> structure the URL so it's most useful to you.  Here are the choices:
>>>>>
>>>>> Choice 1. /top/{project}/{access}/{days-in-the-past}
>>>>>
>>>>> Example: top articles via all en.wikipedia sites for the past 30 days:
>>>>> /top/en.wikipedia/all-access/30
>>>>>
>>>>>
>>>>> Choice 2. /top/{project}/{access}/{start}/{end}
>>>>>
>>>>> Example: top articles via all en.wikipedia sites from June 12th, 2014
>>>>> to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
>>>>>
>>>>>
>>>>> (in all of those,
>>>>>
>>>>> * {project} means en.wikipedia, commons.wikimedia, etc.
>>>>> * {access} means access method as in desktop, mobile web, mobile app
>>>>>
>>>>> )
>>>>>
>>>>> Which do you prefer?  Would any other query style be useful?
>>>>>
>>>>> ___
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> Gabriel Wicke
>>> Principal Engineer, Wikimedia Foundation
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-10 Thread Marcel Ruiz Forns

If we are going to completely denormalize the data sets for anonymization,
and we expect just slice and dice queries to the database,
I think we wouldn't take much advantage of a relational DB,
because it wouldn't need to aggregate values, slice or dice,
all slices and dices would be precomputed, right?

It seems to me that the nature of this denormalized/anonymized data sets is
more like a key-value store. That's why I suggested Voldemort at first
(which, they say, has a slightly faster read than Cassandra), but I see the
preference for Cassandra for it being a known tool inside WMF.
So, +1 for Cassandra!

However, if we foresee the need of adding more data sets to the same DB, or
querying them in a different way, key-value store would be a limitation.


On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu dandree...@wikimedia.org
wrote:



 On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke gwi...@wikimedia.org
 wrote:

 On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu dandree...@wikimedia.org
 wrote:

 Eric, I think we should allow arbitrary querying on any dimension for
 that first data block.  We could pre-aggregate all of those combinations
 pretty easily since the dimensions have very low cardinality.


 Are you thinking about something like
 /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
 dimensions?


 only one more right now, called agent_type.  But this is just the first
 cube and we're planning a geo cube with more dimensions and are probably
 going to try and release data split up by access method (mobile, desktop,
 etc.) and other dimensions as people need them.  This will be tricky as we
 try to protect privacy but that aside, the number of dimensions per
 endpoint, right now, seems to hover around 4 or 5.




 For the article-level data, no, we'd want just basic timeseries querying.

 Thanks Gabriel, if you could point us to an example of these secondary
 RESTBase indices, that'd be interesting.


  The API used to define these tables is described in
 https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
 and the algorithm used to keep those indexes up to date is described in
 https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
 and largely implemented in
 https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
 .


 very cool, thx.

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Analytics-internal] Parsing the app version into the user agent map

2015-06-08 Thread Marcel Ruiz Forns

Sure,

+analytics

On Mon, Jun 8, 2015 at 5:50 PM, Adam Baso ab...@wikimedia.org wrote:

 Okay to move this to the public list and remove the internal list?


 On Monday, June 8, 2015, Joseph Allemandou jalleman...@wikimedia.org
 wrote:

 Hi,
 In fact instead of using isAppPageview UDF, one should use access_method
 = 'mobile app' :)

 On Mon, Jun 8, 2015 at 12:44 PM, Marcel Ruiz Forns mfo...@wikimedia.org
 wrote:

 + analytics internal

 Hi Jon and Adam,

 Yes, this totally helps. It confirms the work that we are doing.

 In fact, 3 of the items you list are already working and available for
 querying:

- app (yes/no), through the UDF in hive 'isAppPageview()'
- OS (android, iOS, other), through the user_agent_map['os_family']
field
- OS version, through the user_agent_map['os_major'] field

 And, as discussed, we'll add in short the possibility of querying for
 the 'app version' through the user_agent_map['app_version'] field.

 Thanks!

 Marcel


 On Fri, Jun 5, 2015 at 10:48 PM, Jon Katz jk...@wikimedia.org wrote:

 +Adam, who has been diving deep into this stuff lately.

 Hi Marcel,
 I am a bit swamped right now, so can't look at the tickets, but of the
 strings you showed, the fields more important to me are:

-
 *app (yes/no) *
-
 *OS (android,iOS, other) *
- *app version (numeric)*
- tablet/phone (ios only, right)
- OS version (is this possible?)

 Bolded are big deals :)  Does this help?
 Thanks!
 -J

 On Fri, Jun 5, 2015 at 11:13 AM, Marcel Ruiz Forns 
 mfo...@wikimedia.org wrote:

 Hi Jon,

 Here is Marcel from Analytics, how are you?!

 I am developing the analytics-refinery-source code that will parse the
 missing user-agent info for the mobile app requests. See the task:
 https://phabricator.wikimedia.org/T99932

 I think I understand what your team wants in that respect, so I
 already implemented the functionality, you can check it here:
 https://gerrit.wikimedia.org/r/#/c/216060/
 but I still wanted to confirm with you :-)

 Considering these user agent strings:

 WikipediaApp/2.0-r-2015-04-23 (Android 5.0.1; Phone) Google Play
 WikipediaApp/4.1.2 (iPhone OS 8.3; Tablet)

 The program should take the part after WikipediaApp/ and before the
 next   (space), for example: 2.0-r-2015-04-23 or 4.1.2.
 And store it as part of the user agent map, in a field named i.e.
 app_version.
 So that it is easy queryable like: SELECT
 user_agent_map[app_version] FROM ...

 Is that right?

 That was the question. Thanks!

 Marcel





 ___
 Analytics-internal mailing list
 analytics-inter...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics-internal




 --
 *Joseph Allemandou*
 Data Engineer @ Wikimedia Foundation
 IRC: joal


 ___
 Analytics-internal mailing list
 analytics-inter...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics-internal


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Technical] Pick storage for pageview cubes

2015-06-08 Thread Marcel Ruiz Forns

*This discussion is intended to be a branch of the thread: [Analytics]
Pageview API Status update.*

Hi all,

We Analytics are trying to *choose a storage technology to keep the
pageview data* for analysis.

We don't want to get to a final system that covers all our needs yet (there
are still things to discuss), but have something *that implements the
current stats.grok.se http://stats.grok.se functionalities* as a first
step. This way we can have a better grasp of which will be our difficulties
and limitations regarding performance and privacy.

The objective of this thread is to *choose 3 storage technologies*. We will
later setup an fill each of them with 1 day of test data, evaluate them and
decide which one of them we will go for.

There are 2 blocks of data to be stored:

   1. *Cube that represents the number of pageviews broken down by the
   following dimensions*:
  - day/hour (size: 24)
  - project (size: 800)
  - agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is
less than k=100 have an undefined value. However, to be able to retrieve
aggregated values without loosing that undefined counts, all combinations
of slices and dices are precomputed before anonymization and belong to the
cube, too. Like this:

dim1,  dim2,  dim3,  ...,  dimN,  val
   a,  null,  null,  ...,  null,   15// pv for dim1=a
   a, x,  null,  ...,  null,   34// pv for dim1=a  dim2=x
   a, x, 1,  ...,  null,   27// pv for dim1=a  dim2=x  dim3=1
   a, x, 1,  ...,  true,  undef  // pv for dim1=a  dim2=x  ... 
dimN=true

So the size of this dataset would be something between 100M and 200M
records per year, I think.


   1. *Timeseries dataset that stores the number of pageviews per article
   in time with*:
  - maximum resolution: hourly
  - diminishing resolution over time is accepted if there are
  performance problems

article (dialect.project/article),   day/hour,   value

   en.wikipedia/Main_page,  2015-01-01 17,  123456

en.wiktionary/Bazinga,  2015-01-02 13,   23456

It's difficult to calculate the size of that. How many articles do we have?
34M?
But not all of them will have pageviews every hour...



*Note*: I guess we should consider that the storage system will presumably
have high volume batch inserts every hour or so, and queries that will be a
lot more frequent but also a lot lighter in data size.

And that is that.
*So please, feel free to suggest storage technologies, comment, etc!*
And if there is any assumption I made in which you do not agree, please
comment also!

I will start the thread with 2 suggestions:
1) *PostgreSQL*: Seems to be able to handle the volume of the data and
knows how to implement diminishing resolution for timeseries.
2) *Project Voldemort*: As we are denormalizing the cube entirely for
anonymity, the db doesn't need to compute aggregations, so it may well be a
key-value store.

Cheers!

Marcel
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging issues 2015-05-06

2015-05-19 Thread Marcel Ruiz Forns

The data for this period has been back-filled with success.

Cheers!

On Fri, May 8, 2015 at 4:23 PM, Aaron Halfaker ahalfa...@wikimedia.org
wrote:

 Thank you!

 On Fri, May 8, 2015 at 5:12 AM, Marcel Ruiz Forns mfo...@wikimedia.org
 wrote:

 EventLogging suffered from performance problems and data loss from
 Tuesday 2015-05-05 22:00 UTC to Wednesday 2015-05-06 20:00 UTC (22 hours).

 During that period, an exceptional amount of events were sent to EL
 server for a given schema. The system could not handle them properly, and
 this caused data loss (30%-40% during the period) and some small gaps in
 the db. All schemas were affected.

 The missing data will be backfilled during this week.

 Phab Task: https://phabricator.wikimedia.org/T98588
 Incident documentation:

 https://wikitech.wikimedia.org/wiki/Incident_documentation/20150506-EventLogging

 Cheers,

 Marcel

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] WMF-Last-Access

2015-04-27 Thread Marcel Ruiz Forns

+1 'last'
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Technical] EventLogging issues

2015-04-20 Thread Marcel Ruiz Forns

Hi Analytics,

We have found a problem that has been affecting the EventLogging data for
one month. Since March 22, 2015 there have been several gaps (of around 1-2
hours length each) without data in all schema tables.

You can see the details in the following links:

Phabricator task:
https://phabricator.wikimedia.org/T96082
Incident documentation:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150409-EventLogging
Technical discussion:
https://lists.wikimedia.org/pipermail/analytics/2015-April/003775.html

The problem still persists, although with less frequency due to sampling
added to the Edit schema events, that have reduced EL throughput.

Next steps:
* The backfilling of the data gaps will be carried out this week.
* Implement a patch to avoid the problem ASAP.
* Implement a consistent solution to EventLogging scaling problems.
* Backfill any gaps that occur during implementation of the solutions.

Marcel
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Strange behavior of EL m4-master

2015-04-15 Thread Marcel Ruiz Forns

Hi Sean,


 *However*, the consumer logs indicate the insert timestamp only, not the
 event timestamp (which goes to the table). So it could be that there's some
 data loss inside the consumer code (or in zmq?) that wouldn't stop the
 write flow, but would skip a segment of events. I'll look deeper into this.


We've deployed an improvement on the EL mysql consumer logs, to be sure
that the events that were being inserted at the time of the DB gaps
corresponded indeed with the missing data. And we've seen that the response
is yes, the consumer executes the missing inserts in time and without
errors in the sqlalchemy client.

Can you supply some specific records from the EL logs with timestamps
 that should definitely be in the database, so we can scan the database
 binlog for specific UUIDs or suchlike?


Here are three valid events that were apparently inserted correctly, but
don't appear in the db.
http://pastebin.com/8wm6qkkE
(they are performance events and contain no sensitive data)

-- can you give me some idea of how long your at other moments delay
 is?


I followed the master-slave replication lag for some hours, and perceived a
pattern in the lag: It gets progressively bigger with time, more or less
with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that
point, the data gap happens and the replication lag goes back to few
minutes lag. I could only catch a data gap live 2 times, so that's
definitely not a conclusive statement. But, there's this hypothesis that
the two problems are related.

Sean, I hope that helps answering your questions.
Let us know if you have any idea on this.

Thank you!

Marcel

On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns mfo...@wikimedia.org
wrote:

 Sean, thanks for the quick response:


 We have a binary log on the EL master that holds the last week of
 INSERT statements. It can be dumped and grepped, eg looking at
 10-minute blocks around 2015-04-13 16:30:


 Good to know!

 Zero(!) during 10min after 16:30 doesn't look good. This means the
 database master did not see any INSERTs with 20150413163* timestamps
 within the last week.


 Ok, makes more sense.


 Can you describe how you know that events were
 written normally? Simply a lack of errors from mysql consumer?


 MySQL consumer log not only lacks errors, but has records of successful
 writes to the db at the time of the problems. Also, the processor logs
 indicate homogeneous throughput of valid events during all times.

 *However*, the consumer logs indicate the insert timestamp only, not the
 event timestamp (which goes to the table). So it could be that there's some
 data loss inside the consumer code (or in zmq?) that wouldn't stop the
 write flow, but would skip a segment of events. I'll look deeper into this.


 Can you supply some specific records from the EL logs with timestamps
 that should definitely be in the database, so we can scan the database
 binlog for specific UUIDs or suchlike?


 I'll try to get those.

 -- can you give me some idea of how long your at other moments delay
 is?


 I'll observe the db during the day and give you an estimate.

 Thanks!


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Technical] Strange behavior of EL m4-master

2015-04-14 Thread Marcel Ruiz Forns

Hi Sean,

Here's Marcel from Analytics.

I'd like to comment with you some strange behaviors that we've observed on
EventLogging database (m4-master.equiad.wmnet).

1) There are some time spans where there is no data in any table. Examples
follow:

   - 2015-04-09 17:20 - 18:35
   - 2015-04-11 03:35 - 05:20
   - 2015-04-11 14:00 - 15:20
   - 2015-04-11 19:00 - 20:00
   - 2015-04-12 14:30 - 15:40
   - 2015-04-13 11:35 - 12:30
   - 2015-04-13 16:30 - 17:35

This is happening a lot, as you can see, so we are really concerned about
it.
These outages are not explained by any of the EventLogging logs (processor
logs, consumer logs) which confirm that the events were actually sent
normally to the database (and written) for those time spans.

2) Maybe it has something to do with it, or not. But we observed that
sometimes the last events inserted in the tables date from 30mins -
1h30mins in the past. At other moments, the last inserts date from seconds
ago. This is maybe expected due to slave replication but we're not sure.

Do you have some opinion on what can be happening? Any thought would help a
lot.

Thanks!

Marcel
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Strange behavior of EL m4-master

2015-04-14 Thread Marcel Ruiz Forns

Sean, thanks for the quick response:


 We have a binary log on the EL master that holds the last week of
 INSERT statements. It can be dumped and grepped, eg looking at
 10-minute blocks around 2015-04-13 16:30:


Good to know!

Zero(!) during 10min after 16:30 doesn't look good. This means the
 database master did not see any INSERTs with 20150413163* timestamps
 within the last week.


Ok, makes more sense.


 Can you describe how you know that events were
 written normally? Simply a lack of errors from mysql consumer?


MySQL consumer log not only lacks errors, but has records of successful
writes to the db at the time of the problems. Also, the processor logs
indicate homogeneous throughput of valid events during all times.

*However*, the consumer logs indicate the insert timestamp only, not the
event timestamp (which goes to the table). So it could be that there's some
data loss inside the consumer code (or in zmq?) that wouldn't stop the
write flow, but would skip a segment of events. I'll look deeper into this.


 Can you supply some specific records from the EL logs with timestamps
 that should definitely be in the database, so we can scan the database
 binlog for specific UUIDs or suchlike?


I'll try to get those.

-- can you give me some idea of how long your at other moments delay
 is?


I'll observe the db during the day and give you an estimate.

Thanks!
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

61 matches

Mail list logo