Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

Nuria Ruiz Tue, 30 Jan 2018 06:46:22 -0800

>I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the
page preview aggregates with the >pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.


On my opinion the aggregated data should stay in two different tables. I
can see a future where the preview data is of different types (might
include rich media that was/was not played, there are simple popups and
"richer" ones .. whatever) and the dimensions where you represent this
consumption are not going to match with pageview_hourly which again only
represents well full page loads.

On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto <[email protected]> wrote:

> CoOOOl :)
>
> > Using the GeoIP cookie will require reconfiguring the EventLogging
> varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it would
> be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side to
> geocode or populate the country in the event.
>
> However, if y’all can’t or don’t want to use the country cookie, then
> yaaa, we gotta figure out what to do about IPs and geocoding in
> EventLogging. There are a few options here, but none of them are great. The
> options basically are variations on ‘treat this event schema as special and
> make special conditionals in EventLogging processor code’, or, 'include IP
> and/or geocode all events in all schemas'. We’re not sure which we want to
> do yet, but we did mention this at our offsite today. I think we’ll figure
> this out and make it happen in the next week or two. Whatever the
> implementation ends up being, we’ll get geocoded data into this dataset.
>
> > Is the geocoding code that we use on webrequest_raw available as an
> Hive UDF or in PySpark?
> The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
> UDF
> <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java>
> which ultimately just calls this getGeocodedData
> <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138>
> function, which itself is just a wrapper around the Maxmind API. We may end
> up doing geocoding in the EventLogging server codebase (again, really not
> sure about this yet…), but if we do it will use the same Maxmind databases.
>
>
> > Aggregating the EventLogging data in the same way that we aggregate
> webrequest data into pageviews data will require either: replicating the
> process that does this and keeping the two processes in sync; or
> abstracting away the source table from the aggregation process so that it
> can work on both tables
>
> I’m not totally sure if this works for you all, but I had pictured
> generating aggregates from the page preview events, and then joining the
> page preview aggregates with the pageview aggregates into a new table with
> an extra dimension specifying which type of content view was made.
>
>
> >  I’d appreciate it if someone could estimate how much work it will be
> to implement GeoIP information and the other fields from Pageview hourly
> for EventLogging events
>
> Ya we gotta figure this out still, but actual implementation shouldn’t be
> difficult, however we decide to do it.
>
> On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith <[email protected]>
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Hullo all,It seems like we've arrived at an implementation for the
>> client-side (JS) part of this problem: use EventLogging to track a page
>> interaction from within the Page Previews code. This'll give us the
>> flexibility to take advantage of a stream processing solution if/when it
>> becomes available, to push the definition of a "Page Previews page
>> interaction" to the client, and to rely on any events that we log in the
>> immediate future ending up in tables that we're already familiar with.In
>> principle, I agree with Andrew's argument that adding additional filtering
>> logic to the webrequest refinement process will make it harder to change
>> existing definitions of views or add others in future. In practice though,
>> we'll need to: - Ensure that the server-side EventLogging component records
>> metadata consistent with with our existing content consumption measurement,
>> concretely: the fields available in the
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly
>> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly>
>> table. In particular, that it either doesn't discard the client IP or
>> utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
>> the resulting table so that it can be combined with the pageviews table to
>> generate reports.- Ensure that the events aren't recorded in MySQL.Using
>> the GeoIP cookie will require reconfiguring the EventLogging varnishkafka
>> instance [0], and raises questions about the compatibility with the
>> corresponding field in the pageviews data. Retaining the client IP will
>> require a similar change but will also require that we share the geocoding
>> code with whatever process we use to refine the data that we’re capturing
>> via EventLogging. Is the geocoding code that we use on webrequest_raw
>> available as an Hive UDF or in PySpark?Aggregating the EventLogging data in
>> the same way that we aggregate webrequest data into pageviews data will
>> require either: replicating the process that does this and keeping the two
>> processes in sync; or abstracting away the source table from the
>> aggregation process so that it can work on both tables. We’ll have to
>> maintain the chosen approach until it’s superseded by a stream processing
>> solution, the timeline of which is currently measured in years.My next
>> steps are making sure that Audiences Product's requirements are all visible
>> and to work with Tilman Bayer to create a schema that's suitable for our
>> purposes but hopefully useful to others. Nuria has also offered to give a
>> technical overview of EventLogging, which I think would be a great resource
>> for everyone so I'll look into setting up a meeting. I'd appreciate it if
>> someone could estimate how much work it will be to implement GeoIP
>> information and the other fields from Pageview hourly for EventLogging
>> events on a per-schema basis.Thanks,-Sam[0]
>> https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37
>> <https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37>*
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

Reply via email to