>If I were to venture into writing a changeset for this (made into a task: https://phabricator.wikimedia.org/T87177 ), is everything >self-contained in the EventLogging extension For the proposed solution of sending events to kafka/hadoop thee answer will that there is work to do in the EL extension, puppet and likely refinery as you would need to create a partition where your data might go. I think a meeting will be in order to get a concrete idea of what is what we want to do.
On Mon, Jan 19, 2015 at 5:44 AM, Gilles Dubuc <[email protected]> wrote: > If I were to venture into writing a changeset for this (made into a task: > https://phabricator.wikimedia.org/T87177 ), is everything self-contained > in the EventLogging extension or are there external parts involved in the > current pipeline sending events to the DB in production that I need to be > aware of? > > On Fri, Jan 9, 2015 at 8:40 AM, Gilles Dubuc <[email protected]> wrote: > >> I think Gilles and Erik want to calculate page views for GLAM mainly >>> (although there are some other good reasons too) -- sampling would probably >>> be ok but we'd miss the long tail of views. >>> >> >> That's correct. We're looking to compile media view counts as accurate as >> the ones we have for article views at the moment. Sampling would be fine to >> identify the X most viewed media across a wiki, but it definitely wouldn't >> help small GLAMs who want to get that information about their own >> collection, if their media happen to be "low traffic" in the grand scheme >> of things. I think that the latter is the main use case for doing this, >> which is why I'm looking for a solution that wouldn't involve sampling. >> >> Compiling the top list has entertainment value, letting GLAM contributors >> get accurate statistics about their content improves the chances that they >> will keep contributing more. I think that's more valuable than the >> entertainment factor of the top list. >> >> On Wed, Jan 7, 2015 at 8:02 PM, Toby Negrin <[email protected]> >> wrote: >> >>> I think Gilles and Erik want to calculate page views for GLAM mainly >>> (although there are some other good reasons too) -- sampling would probably >>> be ok but we'd miss the long tail of views. >>> >>> On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz <[email protected]> wrote: >>> >>>> I see. My main point was that -regardless of collection method- we >>>> might not need every single data point to calculate uniques. >>>> >>>> On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <[email protected]> >>>> wrote: >>>> >>>>> Yes -- we disabled it because there wasn't a use case. We have one now >>>>> :) >>>>> >>>>> On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <[email protected]> >>>>> wrote: >>>>> >>>>>> > I believe there is already an EL-Kafka pipeline and this would >>>>>> make it easy to integrate page views with our regular processing. >>>>>> >>>>>> Note that the pipeline was disabled 6 months ago and thus my comment >>>>>> "in the near term" >>>>>> >>>>>> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e >>>>>> >>>>>> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I'd also like us to consider routing this dataset to hadoop. I >>>>>>> believe there is already an EL-Kafka pipeline and this would make it >>>>>>> easy >>>>>>> to integrate page views with our regular processing. >>>>>>> >>>>>>> Gilles -- are mobile page views included in your stream? >>>>>>> >>>>>>> -Toby >>>>>>> >>>>>>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >Great, then I guess it's a matter of only making the data go to >>>>>>>> files and not to DB for the particular schema we'll create. Does >that >>>>>>>> sound like something feasible? How much work would be required to set >>>>>>>> it up? >>>>>>>> I do not think this is feasible on the near term w/o changes in our >>>>>>>> end. I also am not sure it is really needed. You are concern about >>>>>>>> sending >>>>>>>> stuff to db due to "volume", correct? I do not understand why logging >>>>>>>> every >>>>>>>> single data point would be needed. Maybe you can explain that with a >>>>>>>> bit >>>>>>>> more detail for us to grasp the use case? >>>>>>>> >>>>>>>> If it is a matter of identifying distinct requests that can be done >>>>>>>> having sampled your dataset if it is large enough, we can help with >>>>>>>> that >>>>>>>> and leila just put together some docs on this regard, while this is for >>>>>>>> hive queries principles can apply elsewhere: >>>>>>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Right -- couldn't we just tag the URL? >>>>>>>>>> >>>>>>>>> >>>>>>>>> The event of the user actually viewing the image is completely >>>>>>>>> disconnected from the URL hit in Media Viewer, which is why we need >>>>>>>>> EL and >>>>>>>>> can't rely on existing server logs. >>>>>>>>> >>>>>>>>> >>>>>>>>>> Eventlogging data currently does go to files, as well as to the >>>>>>>>>> DB. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Great, then I guess it's a matter of only making the data go to >>>>>>>>> files and not to DB for the particular schema we'll create. Does that >>>>>>>>> sound >>>>>>>>> like something feasible? How much work would be required to set it up? >>>>>>>>> >>>>>>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Eventlogging data currently does go to files, as well as to the >>>>>>>>>> DB. Check it out on stat1003 at /srv/eventlogging/archive. >>>>>>>>>> >>>>>>>>>> If you need something with higher throughput then eventlogging >>>>>>>>>> itself supports…then let’s talk :D >>>>>>>>>> >>>>>>>>>> -Ao >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> You mean attach an X-analytics parameter, for extra images beyond >>>>>>>>>> the one the user initially requested. >>>>>>>>>> >>>>>>>>>> But then we would undercount, basically missing all image views >>>>>>>>>> from clicking right arrow in image viewer. >>>>>>>>>> I'm not sure how much we would miss then. >>>>>>>>>> iirc Gilles said this browsing feature was used quite a long, but >>>>>>>>>> I'm not sure. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *From:* [email protected] [ >>>>>>>>>> mailto:[email protected] >>>>>>>>>> <[email protected]>] *On Behalf Of *Toby >>>>>>>>>> Negrin >>>>>>>>>> *Sent:* Tuesday, January 06, 2015 19:16 >>>>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>>>>>>>> who has an interest in Wikipedia and analytics. >>>>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>>>>>>>>> file instead of the DB >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Right -- couldn't we just tag the URL? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Just to clarify, this is about prefetched images which have not >>>>>>>>>> been shown to the public. >>>>>>>>>> >>>>>>>>>> They were sent to the browser ahead of a possible request to >>>>>>>>>> speed things up but in many cases never actually requested. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images >>>>>>>>>> >>>>>>>>>> - Erik >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *From:* [email protected] [mailto: >>>>>>>>>> [email protected]] *On Behalf Of *Toby Negrin >>>>>>>>>> *Sent:* Tuesday, January 06, 2015 18:49 >>>>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>>>>>>>> who has an interest in Wikipedia and analytics. >>>>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log >>>>>>>>>> file instead of the DB >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Gilles -- why won't the page view logs work by themselves for >>>>>>>>>> this purpose? EL can be configured to write into Hadoop which is >>>>>>>>>> probably >>>>>>>>>> the best way to get the throughput you need but it seems >>>>>>>>>> overcomplicated. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -Toby >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> This depends on [1] so we're not going to need that immediately, >>>>>>>>>> but in order to help Erik Zachte with his RfC [2] to track unique >>>>>>>>>> media >>>>>>>>>> views in Media Viewer, I'm going to need to use something almost >>>>>>>>>> exactly >>>>>>>>>> like EventLogging. The main difference being that it should skip >>>>>>>>>> writing to >>>>>>>>>> the database and write to a log file instead. >>>>>>>>>> >>>>>>>>>> That's because we'll be recording around 20-25M image views per >>>>>>>>>> day, which would needlessly overload EventLogging for little purpose >>>>>>>>>> since >>>>>>>>>> the data will be used for offline stats generation and doesn't need >>>>>>>>>> to be >>>>>>>>>> made available in a relational database. Of course if storage space >>>>>>>>>> and >>>>>>>>>> EventLogging capacity were no object, we could just use EL and keep >>>>>>>>>> the >>>>>>>>>> ever-growing table forever, but I have the impression that we want >>>>>>>>>> to be >>>>>>>>>> reasonable here and only write to a log, since that's what Erik >>>>>>>>>> needs. >>>>>>>>>> >>>>>>>>>> So here's the question: for a specific schema, can EventLogging >>>>>>>>>> work the way it does but only record hits to a log file (maybe it >>>>>>>>>> already >>>>>>>>>> does that before hitting the DB?) and not write to the DB? If not, >>>>>>>>>> how >>>>>>>>>> difficult would it be to make EL capable of doing that? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://phabricator.wikimedia.org/T44815 >>>>>>>>>> [2] >>>>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Analytics mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Analytics mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Analytics mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Analytics mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Analytics mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
