>I’m not totally sure if this works for you all, but I had pictured generating aggregates from the page preview events, and then joining the page preview aggregates with the >pageview aggregates into a new table with an extra dimension specifying which type of content view was made.
On my opinion the aggregated data should stay in two different tables. I can see a future where the preview data is of different types (might include rich media that was/was not played, there are simple popups and "richer" ones .. whatever) and the dimensions where you represent this consumption are not going to match with pageview_hourly which again only represents well full page loads. On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto <[email protected]> wrote: > CoOOOl :) > > > Using the GeoIP cookie will require reconfiguring the EventLogging > varnishkafka instance [0] > > I’m not familiar with this cookie, but, if we used it, I thought it would > be sent back to by the client in the event. E.g. event.country = > response.headers.country; EventLogging.emit(event); > > That way, there’s no additional special logic needed on the server side to > geocode or populate the country in the event. > > However, if y’all can’t or don’t want to use the country cookie, then > yaaa, we gotta figure out what to do about IPs and geocoding in > EventLogging. There are a few options here, but none of them are great. The > options basically are variations on ‘treat this event schema as special and > make special conditionals in EventLogging processor code’, or, 'include IP > and/or geocode all events in all schemas'. We’re not sure which we want to > do yet, but we did mention this at our offsite today. I think we’ll figure > this out and make it happen in the next week or two. Whatever the > implementation ends up being, we’ll get geocoded data into this dataset. > > > Is the geocoding code that we use on webrequest_raw available as an > Hive UDF or in PySpark? > The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive > UDF > <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java> > which ultimately just calls this getGeocodedData > <https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138> > function, which itself is just a wrapper around the Maxmind API. We may end > up doing geocoding in the EventLogging server codebase (again, really not > sure about this yet…), but if we do it will use the same Maxmind databases. > > > > Aggregating the EventLogging data in the same way that we aggregate > webrequest data into pageviews data will require either: replicating the > process that does this and keeping the two processes in sync; or > abstracting away the source table from the aggregation process so that it > can work on both tables > > I’m not totally sure if this works for you all, but I had pictured > generating aggregates from the page preview events, and then joining the > page preview aggregates with the pageview aggregates into a new table with > an extra dimension specifying which type of content view was made. > > > > I’d appreciate it if someone could estimate how much work it will be > to implement GeoIP information and the other fields from Pageview hourly > for EventLogging events > > Ya we gotta figure this out still, but actual implementation shouldn’t be > difficult, however we decide to do it. > > On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith <[email protected]> > wrote: > >> >> >> >> >> >> >> >> >> >> >> >> *Hullo all,It seems like we've arrived at an implementation for the >> client-side (JS) part of this problem: use EventLogging to track a page >> interaction from within the Page Previews code. This'll give us the >> flexibility to take advantage of a stream processing solution if/when it >> becomes available, to push the definition of a "Page Previews page >> interaction" to the client, and to rely on any events that we log in the >> immediate future ending up in tables that we're already familiar with.In >> principle, I agree with Andrew's argument that adding additional filtering >> logic to the webrequest refinement process will make it harder to change >> existing definitions of views or add others in future. In practice though, >> we'll need to: - Ensure that the server-side EventLogging component records >> metadata consistent with with our existing content consumption measurement, >> concretely: the fields available in the >> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly >> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly> >> table. In particular, that it either doesn't discard the client IP or >> utilizes the GeoIP cookie sent by the client for this schema.- Aggregate >> the resulting table so that it can be combined with the pageviews table to >> generate reports.- Ensure that the events aren't recorded in MySQL.Using >> the GeoIP cookie will require reconfiguring the EventLogging varnishkafka >> instance [0], and raises questions about the compatibility with the >> corresponding field in the pageviews data. Retaining the client IP will >> require a similar change but will also require that we share the geocoding >> code with whatever process we use to refine the data that we’re capturing >> via EventLogging. Is the geocoding code that we use on webrequest_raw >> available as an Hive UDF or in PySpark?Aggregating the EventLogging data in >> the same way that we aggregate webrequest data into pageviews data will >> require either: replicating the process that does this and keeping the two >> processes in sync; or abstracting away the source table from the >> aggregation process so that it can work on both tables. We’ll have to >> maintain the chosen approach until it’s superseded by a stream processing >> solution, the timeline of which is currently measured in years.My next >> steps are making sure that Audiences Product's requirements are all visible >> and to work with Tilman Bayer to create a schema that's suitable for our >> purposes but hopefully useful to others. Nuria has also offered to give a >> technical overview of EventLogging, which I think would be a great resource >> for everyone so I'll look into setting up a meeting. I'd appreciate it if >> someone could estimate how much work it will be to implement GeoIP >> information and the other fields from Pageview hourly for EventLogging >> events on a per-schema basis.Thanks,-Sam[0] >> https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37 >> <https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37>* >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
