Re: [Analytics] Availability of hourly pagecounts files

2020-01-11 Thread James Salsman
That's fascinating, John; thank you. I'm copying this to wiki-research-l and
Fabian Suchanek, who gave the first part of the Research Showcase last month.

What do you like for coding stories? https://quanteda.io/reference/dfm.html ?
Sentiment is hard because errors are often 180 degrees away from correct.

How do you both feel about Soru et al (June 2018) "Neural Machine Translation
for Query Construction and Composition"
https://www.researchgate.net/publication/326030040 ?


On Sat, Jan 11, 2020 at 3:46 PM John Urbanik  wrote:
>
> Jim,
>
> I used to work as the chief data scientist at Collin's company.
>
> I'd suggest looking at things like relationships between the views / edits 
> for sets of pages as well as aggregating large sets of page views for 
> different pages in various ways. There isn't a lot of literature that is 
> directly applicable, and I can't disclose the precise methods being used due 
> to NDA.
>
> In general, much of the pageview data is weibull or GEV distributed on top of 
> being non-stationary, so I'd suggest looking into papers from extreme value 
> theory literature as well as literature around Hawkes/Queue-Hawkes processes. 
> Most traditional ML and signal processing is not very effective without doing 
> some pretty substantial pre-processing, and even then things are pretty 
> messy, depending on what you're trying to predict; most variables are 
> heteroskedastic w.r.t pageviews and there are a lot of real world events that 
> can cause false positives.
>
> Further, concept drift is pretty rapid in this space and structural breaks 
> happen quite frequently, so the reliability of a given predictor can change 
> extremely rapidly. Understanding how much training data to use for a given 
> prediction problem is itself a super interesting problem since there may be 
> some horizon after which the predictor loses power, but decreasing the 
> horizon too much means over fitting and loss of statistical significance.
>
> Good luck!
>
> John

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Availability of hourly pagecounts files

2020-01-11 Thread James Salsman
Colin,

Are hourly pageviews working for you again?

Where can we read more about how your company uses them for predictions?

I've been working on this problem for a long time, for scaling news
signals, along with Google Trends (which you have to de-normalize with
multiple overlapping queries because it scales results to always be in
[0,100].) Can you post a bibliography of your favorite two or three
resources from the last few and five years, please?

Best regards,
Jim
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wikimedia Research Showcase] June 26, 2019 at 11:30 AM PST, 19:30 UTC

2019-06-27 Thread James Salsman
> For those that couldn't make it, Is there are summary of what was said?

Full recording: https://www.youtube.com/watch?v=WiUfpmeJG7E

Slides:

https://www.mediawiki.org/wiki/File:Trajectories_of_Blocked_Community_Members_-_Slides.pdf

https://meta.wikimedia.org/wiki/University_of_Virginia/Automatic_Detection_of_Online_Abuse

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] please hire a CTO who wants to protect reader privacy

2019-03-23 Thread James Salsman
I noticed just now that the Foundation is soliciting applications for a new CTO:
https://www.linkedin.com/feed/update/urn:li:activity:6515003866130505729

Can we please hire a CTO who would prefer to protect reader privacy
above the interests of any State or non-state actors, whether they
have infiltrated staff, contractor, and NDA signatory ranks, and
whether it interferes with reader statistics and analytics or not,
please?

In particular, I would like to repeat my request that we should not be logging
personally identifiable information which might increase our subpoena
burden or result in privacy violation incidents. Fuzzing geolocation
is okay, but we should not be transmitting IP addresses into logs
across even a LAN, for example, and we certainly shouldn't be
purchasing hardware with backdoor coprocessors wasting electricity and
exposing us to government or similar intrusions:
https://lists.wikimedia.org/pipermail/analytics/2017-January/005696.html

Best regards,
Jim

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Wiktionary word page views?

2018-10-24 Thread James Salsman
Thank you everyone who helped answer my question. I have a related
question, about the pageviews reported for audio pronunciation files
on Commons by e.g.,
 
https://tools.wmflabs.org/pageviews/?project=commons.wikimedia.org=all-access=user=latest-20=File:En-us-banana.ogg
Are those numbers the authentic number of times that someone clicked
"play" on the audio widget to download the raw audio, or the number of
times they went to see the [[commons:File:En-us-banana.ogg]] page
without necessarily playing the sound?

It's really amazing how much more popular the English Wiktionary has
become over the past two years. After having analyzed a large sample
of the most frequently spoken multisyllabic words, I can say with
confidence that while 2016 saw about as much usage as 2015 on the
English Wiktionary, 2017 showed a 35% increase, and so far it looks
like 2018 will be a 25-30% increase over that. As the very frequent
words in question generally haven't been edited much at all in the
past five to ten years, I'm confused as to why this has happened. Does
anyone have any ideas?

On Tue, Oct 23, 2018 at 3:36 PM James Salsman  wrote:
>
> How can I get pageview statistics for individual words in the English
> Wiktionary?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Wiktionary word page views?

2018-10-23 Thread James Salsman
How can I get pageview statistics for individual words in the English
Wiktionary?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Privacy with GDPR at Stanford May 31, 5:30-7pm

2018-05-09 Thread James Salsman
Note: Nicole A. Ozer, the Technology and Civil Liberties Director at
ACLU of California, has been added to this panel.

Please see also: https://teachprivacy.com/why-i-love-the-gdpr/


On Thu, May 3, 2018 at 10:58 AM, James Salsman <jsals...@gmail.com> wrote:
> [I recommend registering early:
>  http://web.stanford.edu/dept/law/forms/KELF18.fb
> because free Stanford CLE events occasionally fill up and get closed
> by the organizers (or with luck, moved into larger venues.) -Jim]
>
> The Data Privacy Shake-Up
>
> Thursday, May 31, 2018
> Stanford Rock Center for Corporate Governance
> Room 290, Stanford Law School
> 559 Nathan Abbott Way, Stanford, CA 94305
> 5:30-6pm: Reception, 6-7pm: Panel Discussion.
> Free and open to the public.
>
> http://enews.law.stanford.edu/t/r-3E98AC489E9B8A952540EF23F30FEDED
>
> As more data becomes digitized, and consumers increasingly share
> personal and sensitive information with devices and apps, concerns
> around data privacy are taking on greater importance. These concerns
> are fueled by recent high-profile data misuse scandals that have
> thrust the data gathering and privacy policies of companies into the
> spotlight. Join us for a discussion of data privacy on both sides of
> the pond, from the future of data privacy regulation in the US to the
> recent introduction of the GDPR with its extraterritorial effect and
> high maximum penalties, and the ripple effect of trends across the
> globe.
>
> This session, the Fifth Annual Kirkland & Ellis Law Forum, will bring
> together leading in-house attorneys, private practitioners and
> academics to discuss their views on the shifting sands of the data
> privacy landscape.
>
> Panelists:
>
> Michael Callahan
> Senior Vice President / General Counsel, LinkedIn (and incoming
> Executive Director of the Stanford Rock Center for Corporate
> Governance)
>
> Emma L. Flett
> Partner, Kirkland & Ellis
>
> John Lynn
> Partner, Kirkland & Ellis
>
> 1 hour California Continuing Legal Education credit

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Privacy with GDPR at Stanford May 31, 5:30-7pm

2018-05-03 Thread James Salsman
[I recommend registering early:
 http://web.stanford.edu/dept/law/forms/KELF18.fb
because free Stanford CLE events occasionally fill up and get closed
by the organizers (or with luck, moved into larger venues.) -Jim]

The Data Privacy Shake-Up

Thursday, May 31, 2018
Stanford Rock Center for Corporate Governance
Room 290, Stanford Law School
559 Nathan Abbott Way, Stanford, CA 94305
5:30-6pm: Reception, 6-7pm: Panel Discussion.
Free and open to the public.

http://enews.law.stanford.edu/t/r-3E98AC489E9B8A952540EF23F30FEDED

As more data becomes digitized, and consumers increasingly share
personal and sensitive information with devices and apps, concerns
around data privacy are taking on greater importance. These concerns
are fueled by recent high-profile data misuse scandals that have
thrust the data gathering and privacy policies of companies into the
spotlight. Join us for a discussion of data privacy on both sides of
the pond, from the future of data privacy regulation in the US to the
recent introduction of the GDPR with its extraterritorial effect and
high maximum penalties, and the ripple effect of trends across the
globe.

This session, the Fifth Annual Kirkland & Ellis Law Forum, will bring
together leading in-house attorneys, private practitioners and
academics to discuss their views on the shifting sands of the data
privacy landscape.

Panelists:

Michael Callahan
Senior Vice President / General Counsel, LinkedIn (and incoming
Executive Director of the Stanford Rock Center for Corporate
Governance)

Emma L. Flett
Partner, Kirkland & Ellis

John Lynn
Partner, Kirkland & Ellis

1 hour California Continuing Legal Education credit

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] weekly periodicity mode

2018-04-23 Thread James Salsman
[Crossposting to Research and Analytics lists]

Most Wikipedia articles with a weekly periodicity show more pageviews
on a typical weekday than a weekend. Some articles associated with
weekends (e.g. articles associated with a variety of hobbies) will
show relatively fewer pave views on weekdays.

Suppose I wanted to plot a heatmap with colors corresponding to the
strength of the weekly periodicity of the pageviews of articles shown
in different geographic locations.

(1) Has anyone done anything like this before?

(2) Is sufficient information available from the current logging regime?

Finally, I would also like to ask for review of this summarization, please:
 
https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology%2FAnnual_Plans%2FFY2019%2FCDP3%3A_Knowledge_Integrity=revision=2762601=2762351

Best regards,
Jim

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Ingoing and outgoing internal links enquiry

2018-03-10 Thread James Salsman
Hi Nick,

I made a Quarry query to do this for you: https://quarry.wmflabs.org/query/25400

You will have to fork it and remove the "LIMIT 10" to get it to run on
all the English Wikipedia articles. It may take too long or produce
too much data, in which case please ask on this list for someone who
can run it for you.

USE enwiki_p;

SELECT page_title as article, COUNT(DISTINCT pli.pl_from) as inlinks,
COUNT(DISTINCT plo.pl_title) as outlinks
FROM page

JOIN pagelinks AS pli ON page.page_title = pli.pl_title AND pli.pl_namespace = 0
AND page.page_namespace = 0 AND page.page_is_redirect = 0

JOIN pagelinks AS plo ON page.page_id = plo.pl_from AND plo.pl_namespace = 0
AND page.page_namespace = 0 AND page.page_is_redirect = 0

GROUP BY article
LIMIT 10;

Refs.: https://www.mediawiki.org/wiki/Manual:Pagelinks_table
https://www.mediawiki.org/wiki/Manual:Page_table

> From: Nick Bell 
> Subject: [Analytics] Ingoing and outgoing internal links enquiry
>
>  Dear Analytics Team,
>
> I’m doing a project on Wikipedia for my Maths degree, and I was hoping you
> could help me acquire some data about Wikipedia.
>
> I would like to get the number of incoming internal links and outgoing
> internal links for every page, if possible. I could limit this if needs be,
> as I am aware this totals around 11 million values.
>
> I have minimal programming experience, so if this is unreasonable or
> impossible please let me know. I very much appreciate your time considering
> my request.
>
>
>
> Many thanks,
>
>
> Nicholas Bell
>
> Mathematics Undergraduate
>
> University of Bristol

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] is there an hourly pageviews API?

2018-01-19 Thread James Salsman
>> Hourly pageviews are in
>> /public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz
>>
>> Is there an API faster than zgreping those?
>
> https://wikimedia.org/api/rest_v1/#/ :-)

Thanks, Thomas, but that has only daily and monthly granularity for articles.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] is there an hourly pageviews API?

2018-01-19 Thread James Salsman
Hourly pageviews are in
/public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz

Is there an API faster than zgreping those?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] use cases for the raw IP data stored on the webrequest data

2017-01-29 Thread James Salsman
Dan,

I missed this reply in November to which you referred:

>> Do the advantages of keeping unanonymized IP reader logs for potential
>> debugging needs outweigh the privacy disadvantages?
>
> Judging from prior postings to this list the community members interest
> in correctness of pageview data, pageview tools and pageview API far
> outweights the concerns with a 60 day retention of raw IPs.

Is that the official position of the Foundation? It is has been
explicitly contradicted by the Executive Director, and is not
considered an acceptable practice by the EFF:

https://www.eff.org/pages/eff-ad-wired

or the American Library Association:

http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy

http://www.ala.org/advocacy/library-privacy-guidelines-data-exchange-between-networked-devices-and-services

http://www.ala.org/advocacy/privacyconfidentiality

or this law review article:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

or news media expose articles such as:

https://www.washingtonpost.com/news/the-switch/wp/2016/10/11/facebook-twitter-and-instagram-sent-feeds-that-helped-police-track-minorities-in-ferguson-and-baltimore-aclu-says/

>> 3. Can Ops use access logs in which the article names have never been
>> stored on permanent, non-RAM media?
>>
>> 4. Can the users who require logs of article names use those in which
>> the IP address, proxy information, and geolocation has never been
>> stored on permanent media?
>
> The implicit assumption here is that reasonable means are not being taken
> to safeguard user data by Technical Operations

Such measures do not address the subpoena-related concerns of the EEF,
the ALA, the law review article, or the news media expose.
Furthermore, it has been shown that the reader data leaves the custody
of Technical operations on page 20 here:

http://infolab.stanford.edu/~west1/pubs/West_Dissertation-2016.pdf

That says, "We have access to Wikimedia’s full server logs, containing
all HTTP requests to Wikimedia projects." Page 19 indicates that this
information includes the "IP address, proxy information, and user agent."
See also:

https://youtu.be/jQ0NPhT-fsE=25m40s

> You have also made other technical assumptions, such
> as that one can only use volatile storage to safely store data.

On the contrary, the assumption is that it's safer to not store PII on
nonvolatile storage if it can be associated with the names of articles
being read.

If a GET web request comes in from a reader, and the article name is
stored in one disk file with the time accurate to the hour, and the IP
and proxy information with an exact timestamp is stored in another
file, would that meet all of the Foundation's and research community's
needs?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] use cases for the raw IP data stored on the webrequest data

2017-01-28 Thread James Salsman
I've added the following unanswered questions at

https://wikitech.wikimedia.org/wiki/Talk:Analytics/Data/Webrequest/RawIPUsage

1. Is the ability to rerun metrics more important than protecting
reader privacy?

2. On what basis is the decision on the previous question made, or if
there is no decision on the question yet, who has the authority to
establish that basis?

3. Can Ops use access logs in which the article names have never been
stored on permanent, non-RAM media?

4. Can the users who require logs of article names use those in which
the IP address, proxy information, and geolocation has never been
stored on permanent media?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-19 Thread James Salsman
Here is a commercial malware-scanning proxy all but claiming outright
that they can MITM-scan any browser protocol not using QUIC:

http://www.bitdefender.com/support/how-to-disable-quic-protocol-in-google-chrome-1669.html

Security is such a mess these days that I hope you all understand why
I keep saying you shouldn't be storing readers' article names
associated with any of their IP, proxy, or geolocation, separating
them as soon as they hit RAM on the ingress proxies.


On Thu, Jan 19, 2017 at 4:16 PM, James Salsman <jsals...@gmail.com> wrote:
>> But we are https-only now, am I missing something?
>
> These authors say that TLS 1.2/ECDHE_RSA/P-256 as used by enwiki
> currently is still within the capability of hobbyists to crack in a
> few days on less than $10,000 of hardware, if I'm reading it right:
>  https://hal.inria.fr/hal-01244855/document
>
> QUIC would be a lot better, with X25519 at least. That's what Google
> moved to after that paper was published.
>
>> how do you have that screenshot?
>
> It's linked from the footnote on page 33 of this lawsuit by the
> Foundation and ACLU asking the government to stop monitoring Wikipedia
> traffic:
>
> https://www.aclu.org/sites/default/files/field_document/23._aclu_appeal_brief_2.17.2016.pdf

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-19 Thread James Salsman
> But we are https-only now, am I missing something?

These authors say that TLS 1.2/ECDHE_RSA/P-256 as used by enwiki
currently is still within the capability of hobbyists to crack in a
few days on less than $10,000 of hardware, if I'm reading it right:
 https://hal.inria.fr/hal-01244855/document

QUIC would be a lot better, with X25519 at least. That's what Google
moved to after that paper was published.

> how do you have that screenshot?

It's linked from the footnote on page 33 of this lawsuit by the
Foundation and ACLU asking the government to stop monitoring Wikipedia
traffic:

https://www.aclu.org/sites/default/files/field_document/23._aclu_appeal_brief_2.17.2016.pdf

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-19 Thread James Salsman
> I hope comms figures out a way to counter-act the public
> opinion that Wikipedia traffic is monitored by the government.

Wikipedia is the very first example given by NSA training materials
for how to add sites to the XKEYSCORE GUI:

https://assets.documentcloud.org/documents/2116354/pages/xks-for-counter-cne-p9-large.gif

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread James Salsman
Nuria Ruiz wrote:
>
> You can bring that up with ops team, I doubt we can operate a website
> for hundreds off millions of devices (almost a billion) and troubleshoot
> networking issues, DOS and others without having access to raw IPs for a
> short period of time. Ops work doesn't need to have access to IP data long
> term, just near term.

First, I don't know who or where to ask such questions of the Ops team.

Second, is the suggestion to discard before storing as the default behavior
with a manual switch that can be turned on by Ops to store temporary raw
logs with IP for debugging if and when needed by Ops for a limited time --
say an hour or two -- with automatic zeroing deletion at the end of that time
period a viable solution to this contingency?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread James Salsman
Pine wrote:
>
> I tend to think that checkusers will need the plain IP addresses

I am not suggesting removing the IP addresses or proxy information from
POST requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a secure
hash if we want to keep each GET request's geolocation, to be compliant
with the Privacy Policy. The Privacy Policy is the most prominent policy on
the far left on the footer of every page served by every editable project,
and says explicitly that consent is required for the use of geolocation.
The Privacy and other policies make it clear that POST requests and Visual
Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still
require explicit consent which we have no way to obtain at present.
Editors' geolocations as they edit are very useful for research, but by the
same token have the most serious privacy concerns. Obtaining consent to
store geolocation seems like it would interfere with, complicate, and
disrupt editing. If geolocation is stored with anonymized IP addresses for
GETs but not POSTs or Visual Editor submissions, both could easily be
recovered because of simultaneously interleaved GET and POST requests for
the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a
thorough analysis in light of all the issues raised in
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST
requests, as far as I can tell. If they anticipate a need for non-anonymous
GET requests, then perhaps some kind of a debugging switch which could be
used on a short term basis where an IP range or mask could be entered to
allow matching addresses to log non-anonymously before expiring in an hour
would solve any anticipated need?
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] ensuring reader anonymity

2016-11-08 Thread James Salsman
Are there any reasons to not replace HTTP GET request IP addresses and
proxy information with their SHA-512 secure hash prior to writing them
to permanent media?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] pageview API discrepancy

2016-01-25 Thread James Salsman
Why is there such a difference since January 10 on
http://i.imgur.com/rA1yUaH.png
compared to 
https://analytics.wmflabs.org/demo/pageview-api/#articles=Hillary_Clinton,Bernie_Sanders=2015-11-01=2015-12-22=enwiki
?

Given the corresponding uptick at
http://traffic.alexa.com/graph?u=http%3A%2F%2Fberniesanders.com=http%3A%2F%2Fhillaryclinton.com=http%3A%2F%2Fdonaldjtrump.com=1=400=220=n=3m=e6f3fc
I am inclined to believe that the earlier version is correct.

Has the data been adjusted?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] pageview API discrepancy

2016-01-25 Thread James Salsman
> Well, the first version looks at December to January, and the second
> at November to December, so it looks like an implementation error.

No, sorry, it was my mistake somehow. I must have reset the calendars
back a month. Sorry for the false alarm.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] pageview API discrepancy

2016-01-25 Thread James Salsman
> do you mean you screenshotted it at a different date?

Yes, January ‎23. The URL is identical.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics