Re: [Analytics] A new landing page for the Wikimedia Research team

2018-02-07 Thread Federico Leva (Nemo)

Will it be translatable with standard tools?

Federico

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Can we keep further discussion on the phablet thread?
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Nuria Ruiz
>Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern
about using the GeoIP cookie: >It will create significant discrepancies
with the existing geolocation data we record for pageviews, where we have
chosen to derive this information from the IP instead

How did you came to the conclusion that the data will differ?

GeoIP cookie is inferred from your IP just the same, right?
https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/geoip.inc.vcl.erb#L10




On Wed, Feb 7, 2018 at 9:09 AM, Tilman Bayer  wrote:

> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have created a task for the backend work at https://phabricator.wikimedia.
> org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies with
> the existing geolocation data we record for pageviews, where we have chosen
> to derive this information from the IP instead. (Remember the overarching
> goal here of measuring page previews the same way we measure page views
> currently; the basic principle is that if a reader visits a page and then
> uses the page preview feature on that page to read preview cards, all the
> metadata that is recorded for both should have identical values for both
> the preview and the pageview.) Therefore, we should go with the kind of
> solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>> some reason, we can figure out the backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>>
 > Using the GeoIP cookie will require reconfiguring the EventLogging
 varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it
 would be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side
 to geocode or populate the country in the event.

>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
>>> /beacon/event endpoint, and the event itself. This duplication seems
>>> reasonable when balanced against capturing either: the client IP and then
>>> doing similar geocoding further along in the pipeline; or the cookie for
>>> all requests to that endpoint and then discarding them further along in the
>>> pipeline. It also reflects a seemingly core principle of the EventLogging
>>> system: that it doesn't capture potentiallly PII by default.
>>>
>>> -Sam
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Gonna paste your reply on the ticket
 and respond there.



On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer  wrote:

> On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto  wrote:
> >> It will create significant discrepancies with the existing geolocation
> >> data we record for pageviews
> > If you only need country (or whatever is in the cookie), then likely
> > whatever the output dataset is would only include country when selecting
> > from pageviews.  If you need more than country (it sounded like you
> didn’t),
> > then we can get into doing the IP Geocoding  in EventLogging, but there
> are
> > few technical complications here, and we’re prefer not to have to do
> this if
> > we don’t have to.
>
> As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
> the goal is to record metadata consistent with with our existing
> content consumption measurement, concretely: the fields available in
> the pageview_hourly table. See
> https://phabricator.wikimedia.org/T186728 for details (also regarding
> other fields that are not in EL by default but are likewise generated
> in a standard fashion for webrequest/pageview data).
>
> I appreciate it will need a bit of engineering work to implement your
> proposal of reusing the existing UDF that underlies the pageview data
> for the new preview data. But it will serve to avoid a lot of data
> limitations and headaches for years to come. To highlight just one
> aspect: If we relied on the cookie, the data would be inconsistent
> from the start because not all clients accept cookies. When we want to
> know (say) the ratio of previews to pageviews in a particular country,
> we don't want to have to embark on a research project estimating the
> number of cookie-less pageviews in that country. And so on.
>
>
> >
> > On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer 
> wrote:
> >>
> >> Thanks everyone! Separate from Sam's mapping out the frontend
> >> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have
> >> created a task for the backend work at
> >> https://phabricator.wikimedia.org/T186728 based on this thread.
> >>
> >> Regarding the last few posts about the geolocation information, from the
> >> data analysis perspective, there is indeed another, more serious concern
> >> about using the GeoIP cookie: It will create significant discrepancies
> with
> >> the existing geolocation data we record for pageviews, where we have
> chosen
> >> to derive this information from the IP instead. (Remember the
> overarching
> >> goal here of measuring page previews the same way we measure page views
> >> currently; the basic principle is that if a reader visits a page and
> then
> >> uses the page preview feature on that page to read preview cards, all
> the
> >> metadata that is recorded for both should have identical values for
> both the
> >> preview and the pageview.) Therefore, we should go with the kind of
> solution
> >> Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
> >>
> >> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
> >>>
> >>> Wow Sam, yeah, if this cookie works for you, it will make many things
> >>> much easier for us.  Check it out and let us know.  If it doesn’t work
> for
> >>> some reason, we can figure out the backend geocoding part.
> >>>
> >>>
> >>>
> >>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith 
> wrote:
> 
>  On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto 
> wrote:
> >
> > > Using the GeoIP cookie will require reconfiguring the EventLogging
> > > varnishkafka instance [0]
> >
> > I’m not familiar with this cookie, but, if we used it, I thought it
> > would be sent back to by the client in the event. E.g. event.country
> =
> > response.headers.country; EventLogging.emit(event);
> >
> > That way, there’s no additional special logic needed on the server
> side
> > to geocode or populate the country in the event.
> 
> 
>  Hah! I didn't think about accessing the GeoIP cookie on the client. As
>  you say, the implementation is quite easy.
> 
>  My only concern with this approach is the duplication of the value
>  between the cookie, which is sent in every HTTP request to the
> /beacon/event
>  endpoint, and the event itself. This duplication seems reasonable when
>  balanced against capturing either: the client IP and then doing
> similar
>  geocoding further along in the pipeline; or the cookie for all
> requests to
>  that endpoint and then discarding them further along in the pipeline.
> It
>  also reflects a seemingly core principle of the EventLogging system:
> that it
>  doesn't capture potentiallly PII by default.
> 
>  -Sam
> 
> 
> 
>  ___
>  Analytics mailing list
>  

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Tilman Bayer
On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto  wrote:
>> It will create significant discrepancies with the existing geolocation
>> data we record for pageviews
> If you only need country (or whatever is in the cookie), then likely
> whatever the output dataset is would only include country when selecting
> from pageviews.  If you need more than country (it sounded like you didn’t),
> then we can get into doing the IP Geocoding  in EventLogging, but there are
> few technical complications here, and we’re prefer not to have to do this if
> we don’t have to.

As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
the goal is to record metadata consistent with with our existing
content consumption measurement, concretely: the fields available in
the pageview_hourly table. See
https://phabricator.wikimedia.org/T186728 for details (also regarding
other fields that are not in EL by default but are likewise generated
in a standard fashion for webrequest/pageview data).

I appreciate it will need a bit of engineering work to implement your
proposal of reusing the existing UDF that underlies the pageview data
for the new preview data. But it will serve to avoid a lot of data
limitations and headaches for years to come. To highlight just one
aspect: If we relied on the cookie, the data would be inconsistent
from the start because not all clients accept cookies. When we want to
know (say) the ratio of previews to pageviews in a particular country,
we don't want to have to embark on a research project estimating the
number of cookie-less pageviews in that country. And so on.


>
> On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer  wrote:
>>
>> Thanks everyone! Separate from Sam's mapping out the frontend
>> instrumentation work at https://phabricator.wikimedia.org/T184793 , I have
>> created a task for the backend work at
>> https://phabricator.wikimedia.org/T186728 based on this thread.
>>
>> Regarding the last few posts about the geolocation information, from the
>> data analysis perspective, there is indeed another, more serious concern
>> about using the GeoIP cookie: It will create significant discrepancies with
>> the existing geolocation data we record for pageviews, where we have chosen
>> to derive this information from the IP instead. (Remember the overarching
>> goal here of measuring page previews the same way we measure page views
>> currently; the basic principle is that if a reader visits a page and then
>> uses the page preview feature on that page to read preview cards, all the
>> metadata that is recorded for both should have identical values for both the
>> preview and the pageview.) Therefore, we should go with the kind of solution
>> Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>>
>> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>>>
>>> Wow Sam, yeah, if this cookie works for you, it will make many things
>>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>>> some reason, we can figure out the backend geocoding part.
>>>
>>>
>>>
>>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:

 On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>
> > Using the GeoIP cookie will require reconfiguring the EventLogging
> > varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it
> would be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side
> to geocode or populate the country in the event.


 Hah! I didn't think about accessing the GeoIP cookie on the client. As
 you say, the implementation is quite easy.

 My only concern with this approach is the duplication of the value
 between the cookie, which is sent in every HTTP request to the 
 /beacon/event
 endpoint, and the event itself. This duplication seems reasonable when
 balanced against capturing either: the client IP and then doing similar
 geocoding further along in the pipeline; or the cookie for all requests to
 that endpoint and then discarding them further along in the pipeline. It
 also reflects a seemingly core principle of the EventLogging system: that 
 it
 doesn't capture potentiallly PII by default.

 -Sam



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>>
>> --
>> Tilman Bayer
>> Senior Analyst
>> Wikimedia Foundation
>> IRC (Freenode): HaeB
>>
>> 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
> It will create significant discrepancies with the existing geolocation
data we record for pageviews
If you only need country (or whatever is in the cookie), then likely
whatever the output dataset is would only include country when selecting
from pageviews.  If you need more than country (it sounded like you
didn’t), then we can get into doing the IP Geocoding  in EventLogging, but
there are few technical complications here, and we’re prefer not to have to
do this if we don’t have to.

On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer  wrote:

> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have created a task for the backend work at https://phabricator.wikimedia.
> org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies with
> the existing geolocation data we record for pageviews, where we have chosen
> to derive this information from the IP instead. (Remember the overarching
> goal here of measuring page previews the same way we measure page views
> currently; the basic principle is that if a reader visits a page and then
> uses the page preview feature on that page to read preview cards, all the
> metadata that is recorded for both should have identical values for both
> the preview and the pageview.) Therefore, we should go with the kind of
> solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>> some reason, we can figure out the backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>>
 > Using the GeoIP cookie will require reconfiguring the EventLogging
 varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it
 would be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side
 to geocode or populate the country in the event.

>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
>>> /beacon/event endpoint, and the event itself. This duplication seems
>>> reasonable when balanced against capturing either: the client IP and then
>>> doing similar geocoding further along in the pipeline; or the cookie for
>>> all requests to that endpoint and then discarding them further along in the
>>> pipeline. It also reflects a seemingly core principle of the EventLogging
>>> system: that it doesn't capture potentiallly PII by default.
>>>
>>> -Sam
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Tilman Bayer
Thanks everyone! Separate from Sam's mapping out the frontend
instrumentation work at https://phabricator.wikimedia.org/T184793 , I have
created a task for the backend work at
https://phabricator.wikimedia.org/T186728 based on this thread.

Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern
about using the GeoIP cookie: It will create significant discrepancies with
the existing geolocation data we record for pageviews, where we have chosen
to derive this information from the IP instead. (Remember the overarching
goal here of measuring page previews the same way we measure page views
currently; the basic principle is that if a reader visits a page and then
uses the page preview feature on that page to read preview cards, all the
metadata that is recorded for both should have identical values for both
the preview and the pageview.) Therefore, we should go with the kind of
solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).

On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:

> Wow Sam, yeah, if this cookie works for you, it will make many things much
> easier for us.  Check it out and let us know.  If it doesn’t work for some
> reason, we can figure out the backend geocoding part.
>
>
>
> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>
>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>
>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>>> varnishkafka instance [0]
>>>
>>> I’m not familiar with this cookie, but, if we used it, I thought it
>>> would be sent back to by the client in the event. E.g. event.country =
>>> response.headers.country; EventLogging.emit(event);
>>>
>>> That way, there’s no additional special logic needed on the server side
>>> to geocode or populate the country in the event.
>>>
>>
>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>> you say, the implementation is quite easy.
>>
>> My only concern with this approach is the duplication of the value
>> between the cookie, which is sent in every HTTP request to the
>> /beacon/event endpoint, and the event itself. This duplication seems
>> reasonable when balanced against capturing either: the client IP and then
>> doing similar geocoding further along in the pipeline; or the cookie for
>> all requests to that endpoint and then discarding them further along in the
>> pipeline. It also reflects a seemingly core principle of the EventLogging
>> system: that it doesn't capture potentiallly PII by default.
>>
>> -Sam
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Wikimedia pageviews API slow to update

2018-02-07 Thread Dan Andreescu
Hi Collin,

Indeed usually the processing gets fresh data into the API within a few
hours.  However, sometimes, and especially at the beginning of a month, we
have lots of jobs running in parallel and that slows things down a bit.  Up
to 24 hours of delay would be unusual but nothing too concerning.  If the
data is delayed more than 24 hours, then definitely let us know, there
might be something broken.

Thanks for the note, and being a good consumer of our data :)

On Thu, Feb 1, 2018 at 11:35 AM, Collin Stedman  wrote:

> Hello,
>
> The pageviews API seems to have been slow to write data for 1/31/2018. It
> looks like the data has become available in the past hour, but it's
> normally accessible within 3 hours after midnight UTC. Does anybody know
> what caused the slowdown, and if we should expect it to continue?
>
> Thank you very much,
>
> -CS
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Wikimedia pageviews API slow to update

2018-02-07 Thread Collin Stedman
Hello,

The pageviews API seems to have been slow to write data for 1/31/2018. It
looks like the data has become available in the past hour, but it's
normally accessible within 3 hours after midnight UTC. Does anybody know
what caused the slowdown, and if we should expect it to continue?

Thank you very much,

-CS
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Sam Smith
Just a quick update: I've captured details from this discussion and the
background in https://phabricator.wikimedia.org/T184793. I'd sure
appreciate your feedback.

-Sam
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics