Re: [Analytics] Availability of hourly pagecounts files
That's fascinating, John; thank you. I'm copying this to wiki-research-l and Fabian Suchanek, who gave the first part of the Research Showcase last month. What do you like for coding stories? https://quanteda.io/reference/dfm.html ? Sentiment is hard because errors are often 180 degrees away from correct. How do you both feel about Soru et al (June 2018) "Neural Machine Translation for Query Construction and Composition" https://www.researchgate.net/publication/326030040 ? On Sat, Jan 11, 2020 at 3:46 PM John Urbanik wrote: > > Jim, > > I used to work as the chief data scientist at Collin's company. > > I'd suggest looking at things like relationships between the views / edits > for sets of pages as well as aggregating large sets of page views for > different pages in various ways. There isn't a lot of literature that is > directly applicable, and I can't disclose the precise methods being used due > to NDA. > > In general, much of the pageview data is weibull or GEV distributed on top of > being non-stationary, so I'd suggest looking into papers from extreme value > theory literature as well as literature around Hawkes/Queue-Hawkes processes. > Most traditional ML and signal processing is not very effective without doing > some pretty substantial pre-processing, and even then things are pretty > messy, depending on what you're trying to predict; most variables are > heteroskedastic w.r.t pageviews and there are a lot of real world events that > can cause false positives. > > Further, concept drift is pretty rapid in this space and structural breaks > happen quite frequently, so the reliability of a given predictor can change > extremely rapidly. Understanding how much training data to use for a given > prediction problem is itself a super interesting problem since there may be > some horizon after which the predictor loses power, but decreasing the > horizon too much means over fitting and loss of statistical significance. > > Good luck! > > John ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] Availability of hourly pagecounts files
Colin, Are hourly pageviews working for you again? Where can we read more about how your company uses them for predictions? I've been working on this problem for a long time, for scaling news signals, along with Google Trends (which you have to de-normalize with multiple overlapping queries because it scales results to always be in [0,100].) Can you post a bibliography of your favorite two or three resources from the last few and five years, please? Best regards, Jim ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] [Wikimedia Research Showcase] June 26, 2019 at 11:30 AM PST, 19:30 UTC
> For those that couldn't make it, Is there are summary of what was said? Full recording: https://www.youtube.com/watch?v=WiUfpmeJG7E Slides: https://www.mediawiki.org/wiki/File:Trajectories_of_Blocked_Community_Members_-_Slides.pdf https://meta.wikimedia.org/wiki/University_of_Virginia/Automatic_Detection_of_Online_Abuse ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] please hire a CTO who wants to protect reader privacy
I noticed just now that the Foundation is soliciting applications for a new CTO: https://www.linkedin.com/feed/update/urn:li:activity:6515003866130505729 Can we please hire a CTO who would prefer to protect reader privacy above the interests of any State or non-state actors, whether they have infiltrated staff, contractor, and NDA signatory ranks, and whether it interferes with reader statistics and analytics or not, please? In particular, I would like to repeat my request that we should not be logging personally identifiable information which might increase our subpoena burden or result in privacy violation incidents. Fuzzing geolocation is okay, but we should not be transmitting IP addresses into logs across even a LAN, for example, and we certainly shouldn't be purchasing hardware with backdoor coprocessors wasting electricity and exposing us to government or similar intrusions: https://lists.wikimedia.org/pipermail/analytics/2017-January/005696.html Best regards, Jim ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] Wiktionary word page views?
Thank you everyone who helped answer my question. I have a related question, about the pageviews reported for audio pronunciation files on Commons by e.g., https://tools.wmflabs.org/pageviews/?project=commons.wikimedia.org=all-access=user=latest-20=File:En-us-banana.ogg Are those numbers the authentic number of times that someone clicked "play" on the audio widget to download the raw audio, or the number of times they went to see the [[commons:File:En-us-banana.ogg]] page without necessarily playing the sound? It's really amazing how much more popular the English Wiktionary has become over the past two years. After having analyzed a large sample of the most frequently spoken multisyllabic words, I can say with confidence that while 2016 saw about as much usage as 2015 on the English Wiktionary, 2017 showed a 35% increase, and so far it looks like 2018 will be a 25-30% increase over that. As the very frequent words in question generally haven't been edited much at all in the past five to ten years, I'm confused as to why this has happened. Does anyone have any ideas? On Tue, Oct 23, 2018 at 3:36 PM James Salsman wrote: > > How can I get pageview statistics for individual words in the English > Wiktionary? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] Wiktionary word page views?
How can I get pageview statistics for individual words in the English Wiktionary? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] Privacy with GDPR at Stanford May 31, 5:30-7pm
Note: Nicole A. Ozer, the Technology and Civil Liberties Director at ACLU of California, has been added to this panel. Please see also: https://teachprivacy.com/why-i-love-the-gdpr/ On Thu, May 3, 2018 at 10:58 AM, James Salsman <jsals...@gmail.com> wrote: > [I recommend registering early: > http://web.stanford.edu/dept/law/forms/KELF18.fb > because free Stanford CLE events occasionally fill up and get closed > by the organizers (or with luck, moved into larger venues.) -Jim] > > The Data Privacy Shake-Up > > Thursday, May 31, 2018 > Stanford Rock Center for Corporate Governance > Room 290, Stanford Law School > 559 Nathan Abbott Way, Stanford, CA 94305 > 5:30-6pm: Reception, 6-7pm: Panel Discussion. > Free and open to the public. > > http://enews.law.stanford.edu/t/r-3E98AC489E9B8A952540EF23F30FEDED > > As more data becomes digitized, and consumers increasingly share > personal and sensitive information with devices and apps, concerns > around data privacy are taking on greater importance. These concerns > are fueled by recent high-profile data misuse scandals that have > thrust the data gathering and privacy policies of companies into the > spotlight. Join us for a discussion of data privacy on both sides of > the pond, from the future of data privacy regulation in the US to the > recent introduction of the GDPR with its extraterritorial effect and > high maximum penalties, and the ripple effect of trends across the > globe. > > This session, the Fifth Annual Kirkland & Ellis Law Forum, will bring > together leading in-house attorneys, private practitioners and > academics to discuss their views on the shifting sands of the data > privacy landscape. > > Panelists: > > Michael Callahan > Senior Vice President / General Counsel, LinkedIn (and incoming > Executive Director of the Stanford Rock Center for Corporate > Governance) > > Emma L. Flett > Partner, Kirkland & Ellis > > John Lynn > Partner, Kirkland & Ellis > > 1 hour California Continuing Legal Education credit ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] Privacy with GDPR at Stanford May 31, 5:30-7pm
[I recommend registering early: http://web.stanford.edu/dept/law/forms/KELF18.fb because free Stanford CLE events occasionally fill up and get closed by the organizers (or with luck, moved into larger venues.) -Jim] The Data Privacy Shake-Up Thursday, May 31, 2018 Stanford Rock Center for Corporate Governance Room 290, Stanford Law School 559 Nathan Abbott Way, Stanford, CA 94305 5:30-6pm: Reception, 6-7pm: Panel Discussion. Free and open to the public. http://enews.law.stanford.edu/t/r-3E98AC489E9B8A952540EF23F30FEDED As more data becomes digitized, and consumers increasingly share personal and sensitive information with devices and apps, concerns around data privacy are taking on greater importance. These concerns are fueled by recent high-profile data misuse scandals that have thrust the data gathering and privacy policies of companies into the spotlight. Join us for a discussion of data privacy on both sides of the pond, from the future of data privacy regulation in the US to the recent introduction of the GDPR with its extraterritorial effect and high maximum penalties, and the ripple effect of trends across the globe. This session, the Fifth Annual Kirkland & Ellis Law Forum, will bring together leading in-house attorneys, private practitioners and academics to discuss their views on the shifting sands of the data privacy landscape. Panelists: Michael Callahan Senior Vice President / General Counsel, LinkedIn (and incoming Executive Director of the Stanford Rock Center for Corporate Governance) Emma L. Flett Partner, Kirkland & Ellis John Lynn Partner, Kirkland & Ellis 1 hour California Continuing Legal Education credit ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] weekly periodicity mode
[Crossposting to Research and Analytics lists] Most Wikipedia articles with a weekly periodicity show more pageviews on a typical weekday than a weekend. Some articles associated with weekends (e.g. articles associated with a variety of hobbies) will show relatively fewer pave views on weekdays. Suppose I wanted to plot a heatmap with colors corresponding to the strength of the weekly periodicity of the pageviews of articles shown in different geographic locations. (1) Has anyone done anything like this before? (2) Is sufficient information available from the current logging regime? Finally, I would also like to ask for review of this summarization, please: https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology%2FAnnual_Plans%2FFY2019%2FCDP3%3A_Knowledge_Integrity=revision=2762601=2762351 Best regards, Jim ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] Ingoing and outgoing internal links enquiry
Hi Nick, I made a Quarry query to do this for you: https://quarry.wmflabs.org/query/25400 You will have to fork it and remove the "LIMIT 10" to get it to run on all the English Wikipedia articles. It may take too long or produce too much data, in which case please ask on this list for someone who can run it for you. USE enwiki_p; SELECT page_title as article, COUNT(DISTINCT pli.pl_from) as inlinks, COUNT(DISTINCT plo.pl_title) as outlinks FROM page JOIN pagelinks AS pli ON page.page_title = pli.pl_title AND pli.pl_namespace = 0 AND page.page_namespace = 0 AND page.page_is_redirect = 0 JOIN pagelinks AS plo ON page.page_id = plo.pl_from AND plo.pl_namespace = 0 AND page.page_namespace = 0 AND page.page_is_redirect = 0 GROUP BY article LIMIT 10; Refs.: https://www.mediawiki.org/wiki/Manual:Pagelinks_table https://www.mediawiki.org/wiki/Manual:Page_table > From: Nick Bell> Subject: [Analytics] Ingoing and outgoing internal links enquiry > > Dear Analytics Team, > > I’m doing a project on Wikipedia for my Maths degree, and I was hoping you > could help me acquire some data about Wikipedia. > > I would like to get the number of incoming internal links and outgoing > internal links for every page, if possible. I could limit this if needs be, > as I am aware this totals around 11 million values. > > I have minimal programming experience, so if this is unreasonable or > impossible please let me know. I very much appreciate your time considering > my request. > > > > Many thanks, > > > Nicholas Bell > > Mathematics Undergraduate > > University of Bristol ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] is there an hourly pageviews API?
>> Hourly pageviews are in >> /public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz >> >> Is there an API faster than zgreping those? > > https://wikimedia.org/api/rest_v1/#/ :-) Thanks, Thomas, but that has only daily and monthly granularity for articles. ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] is there an hourly pageviews API?
Hourly pageviews are in /public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz Is there an API faster than zgreping those? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] use cases for the raw IP data stored on the webrequest data
Dan, I missed this reply in November to which you referred: >> Do the advantages of keeping unanonymized IP reader logs for potential >> debugging needs outweigh the privacy disadvantages? > > Judging from prior postings to this list the community members interest > in correctness of pageview data, pageview tools and pageview API far > outweights the concerns with a 60 day retention of raw IPs. Is that the official position of the Foundation? It is has been explicitly contradicted by the Executive Director, and is not considered an acceptable practice by the EFF: https://www.eff.org/pages/eff-ad-wired or the American Library Association: http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy http://www.ala.org/advocacy/library-privacy-guidelines-data-exchange-between-networked-devices-and-services http://www.ala.org/advocacy/privacyconfidentiality or this law review article: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 or news media expose articles such as: https://www.washingtonpost.com/news/the-switch/wp/2016/10/11/facebook-twitter-and-instagram-sent-feeds-that-helped-police-track-minorities-in-ferguson-and-baltimore-aclu-says/ >> 3. Can Ops use access logs in which the article names have never been >> stored on permanent, non-RAM media? >> >> 4. Can the users who require logs of article names use those in which >> the IP address, proxy information, and geolocation has never been >> stored on permanent media? > > The implicit assumption here is that reasonable means are not being taken > to safeguard user data by Technical Operations Such measures do not address the subpoena-related concerns of the EEF, the ALA, the law review article, or the news media expose. Furthermore, it has been shown that the reader data leaves the custody of Technical operations on page 20 here: http://infolab.stanford.edu/~west1/pubs/West_Dissertation-2016.pdf That says, "We have access to Wikimedia’s full server logs, containing all HTTP requests to Wikimedia projects." Page 19 indicates that this information includes the "IP address, proxy information, and user agent." See also: https://youtu.be/jQ0NPhT-fsE=25m40s > You have also made other technical assumptions, such > as that one can only use volatile storage to safely store data. On the contrary, the assumption is that it's safer to not store PII on nonvolatile storage if it can be associated with the names of articles being read. If a GET web request comes in from a reader, and the article name is stored in one disk file with the time accurate to the hour, and the IP and proxy information with an exact timestamp is stored in another file, would that meet all of the Foundation's and research community's needs? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] use cases for the raw IP data stored on the webrequest data
I've added the following unanswered questions at https://wikitech.wikimedia.org/wiki/Talk:Analytics/Data/Webrequest/RawIPUsage 1. Is the ability to rerun metrics more important than protecting reader privacy? 2. On what basis is the decision on the previous question made, or if there is no decision on the question yet, who has the authority to establish that basis? 3. Can Ops use access logs in which the article names have never been stored on permanent, non-RAM media? 4. Can the users who require logs of article names use those in which the IP address, proxy information, and geolocation has never been stored on permanent media? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic
Here is a commercial malware-scanning proxy all but claiming outright that they can MITM-scan any browser protocol not using QUIC: http://www.bitdefender.com/support/how-to-disable-quic-protocol-in-google-chrome-1669.html Security is such a mess these days that I hope you all understand why I keep saying you shouldn't be storing readers' article names associated with any of their IP, proxy, or geolocation, separating them as soon as they hit RAM on the ingress proxies. On Thu, Jan 19, 2017 at 4:16 PM, James Salsman <jsals...@gmail.com> wrote: >> But we are https-only now, am I missing something? > > These authors say that TLS 1.2/ECDHE_RSA/P-256 as used by enwiki > currently is still within the capability of hobbyists to crack in a > few days on less than $10,000 of hardware, if I'm reading it right: > https://hal.inria.fr/hal-01244855/document > > QUIC would be a lot better, with X25519 at least. That's what Google > moved to after that paper was published. > >> how do you have that screenshot? > > It's linked from the footnote on page 33 of this lawsuit by the > Foundation and ACLU asking the government to stop monitoring Wikipedia > traffic: > > https://www.aclu.org/sites/default/files/field_document/23._aclu_appeal_brief_2.17.2016.pdf ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic
> But we are https-only now, am I missing something? These authors say that TLS 1.2/ECDHE_RSA/P-256 as used by enwiki currently is still within the capability of hobbyists to crack in a few days on less than $10,000 of hardware, if I'm reading it right: https://hal.inria.fr/hal-01244855/document QUIC would be a lot better, with X25519 at least. That's what Google moved to after that paper was published. > how do you have that screenshot? It's linked from the footnote on page 33 of this lawsuit by the Foundation and ACLU asking the government to stop monitoring Wikipedia traffic: https://www.aclu.org/sites/default/files/field_document/23._aclu_appeal_brief_2.17.2016.pdf ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic
> I hope comms figures out a way to counter-act the public > opinion that Wikipedia traffic is monitored by the government. Wikipedia is the very first example given by NSA training materials for how to add sites to the XKEYSCORE GUI: https://assets.documentcloud.org/documents/2116354/pages/xks-for-counter-cne-p9-large.gif ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] ensuring reader anonymity
Nuria Ruiz wrote: > > You can bring that up with ops team, I doubt we can operate a website > for hundreds off millions of devices (almost a billion) and troubleshoot > networking issues, DOS and others without having access to raw IPs for a > short period of time. Ops work doesn't need to have access to IP data long > term, just near term. First, I don't know who or where to ask such questions of the Ops team. Second, is the suggestion to discard before storing as the default behavior with a manual switch that can be turned on by Ops to store temporary raw logs with IP for debugging if and when needed by Ops for a limited time -- say an hour or two -- with automatic zeroing deletion at the end of that time period a viable solution to this contingency? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] ensuring reader anonymity
Pine wrote: > > I tend to think that checkusers will need the plain IP addresses I am not suggesting removing the IP addresses or proxy information from POST requests as checkuser requires. We need to anonymize both IP addresses and proxy information with a secure hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. The Privacy Policy is the most prominent policy on the far left on the footer of every page served by every editable project, and says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized. However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present. Editors' geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs but not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable. Do we have any privacy experts on staff who can give these issues a thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ? If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an hour would solve any anticipated need? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] ensuring reader anonymity
Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
[Analytics] pageview API discrepancy
Why is there such a difference since January 10 on http://i.imgur.com/rA1yUaH.png compared to https://analytics.wmflabs.org/demo/pageview-api/#articles=Hillary_Clinton,Bernie_Sanders=2015-11-01=2015-12-22=enwiki ? Given the corresponding uptick at http://traffic.alexa.com/graph?u=http%3A%2F%2Fberniesanders.com=http%3A%2F%2Fhillaryclinton.com=http%3A%2F%2Fdonaldjtrump.com=1=400=220=n=3m=e6f3fc I am inclined to believe that the earlier version is correct. Has the data been adjusted? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] pageview API discrepancy
> Well, the first version looks at December to January, and the second > at November to December, so it looks like an implementation error. No, sorry, it was my mistake somehow. I must have reset the calendars back a month. Sorry for the false alarm. ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re: [Analytics] pageview API discrepancy
> do you mean you screenshotted it at a different date? Yes, January 23. The URL is identical. ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics