I'd also like us to consider routing this dataset to hadoop. I believe there is already an EL-Kafka pipeline and this would make it easy to integrate page views with our regular processing.
Gilles -- are mobile page views included in your stream? -Toby On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]> wrote: > >Great, then I guess it's a matter of only making the data go to files > and not to DB for the particular schema we'll create. Does >that sound like > something feasible? How much work would be required to set it up? > I do not think this is feasible on the near term w/o changes in our end. I > also am not sure it is really needed. You are concern about sending stuff > to db due to "volume", correct? I do not understand why logging every > single data point would be needed. Maybe you can explain that with a bit > more detail for us to grasp the use case? > > If it is a matter of identifying distinct requests that can be done having > sampled your dataset if it is large enough, we can help with that and leila > just put together some docs on this regard, while this is for hive queries > principles can apply elsewhere: > https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques > > > > On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]> wrote: > >> Right -- couldn't we just tag the URL? >>> >> >> The event of the user actually viewing the image is completely >> disconnected from the URL hit in Media Viewer, which is why we need EL and >> can't rely on existing server logs. >> >> >>> Eventlogging data currently does go to files, as well as to the DB. >>> >> >> Great, then I guess it's a matter of only making the data go to files and >> not to DB for the particular schema we'll create. Does that sound like >> something feasible? How much work would be required to set it up? >> >> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]> wrote: >> >>> Eventlogging data currently does go to files, as well as to the DB. >>> Check it out on stat1003 at /srv/eventlogging/archive. >>> >>> If you need something with higher throughput then eventlogging itself >>> supports…then let’s talk :D >>> >>> -Ao >>> >>> >>> >>> >>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]> wrote: >>> >>> You mean attach an X-analytics parameter, for extra images beyond the >>> one the user initially requested. >>> >>> But then we would undercount, basically missing all image views from >>> clicking right arrow in image viewer. >>> I'm not sure how much we would miss then. >>> iirc Gilles said this browsing feature was used quite a long, but I'm >>> not sure. >>> >>> >>> *From:* [email protected] [ >>> mailto:[email protected] >>> <[email protected]>] *On Behalf Of *Toby Negrin >>> *Sent:* Tuesday, January 06, 2015 19:16 >>> *To:* A mailing list for the Analytics Team at WMF and everybody who >>> has an interest in Wikipedia and analytics. >>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>> instead of the DB >>> >>> >>> >>> Right -- couldn't we just tag the URL? >>> >>> >>> >>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <[email protected]> >>> wrote: >>> >>> Just to clarify, this is about prefetched images which have not been >>> shown to the public. >>> >>> They were sent to the browser ahead of a possible request to speed >>> things up but in many cases never actually requested. >>> >>> >>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images >>> >>> - Erik >>> >>> >>> >>> *From:* [email protected] [mailto: >>> [email protected]] *On Behalf Of *Toby Negrin >>> *Sent:* Tuesday, January 06, 2015 18:49 >>> *To:* A mailing list for the Analytics Team at WMF and everybody who >>> has an interest in Wikipedia and analytics. >>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>> instead of the DB >>> >>> >>> >>> Hi Gilles -- why won't the page view logs work by themselves for this >>> purpose? EL can be configured to write into Hadoop which is probably the >>> best way to get the throughput you need but it seems overcomplicated. >>> >>> >>> >>> -Toby >>> >>> >>> >>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <[email protected]> >>> wrote: >>> >>> This depends on [1] so we're not going to need that immediately, but in >>> order to help Erik Zachte with his RfC [2] to track unique media views in >>> Media Viewer, I'm going to need to use something almost exactly like >>> EventLogging. The main difference being that it should skip writing to the >>> database and write to a log file instead. >>> >>> That's because we'll be recording around 20-25M image views per day, >>> which would needlessly overload EventLogging for little purpose since the >>> data will be used for offline stats generation and doesn't need to be made >>> available in a relational database. Of course if storage space and >>> EventLogging capacity were no object, we could just use EL and keep the >>> ever-growing table forever, but I have the impression that we want to be >>> reasonable here and only write to a log, since that's what Erik needs. >>> >>> So here's the question: for a specific schema, can EventLogging work the >>> way it does but only record hits to a log file (maybe it already does that >>> before hitting the DB?) and not write to the DB? If not, how difficult >>> would it be to make EL capable of doing that? >>> >>> >>> [1] https://phabricator.wikimedia.org/T44815 >>> [2] >>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
