I think Gilles and Erik want to calculate page views for GLAM mainly (although there are some other good reasons too) -- sampling would probably be ok but we'd miss the long tail of views.
On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz <[email protected]> wrote: > I see. My main point was that -regardless of collection method- we might > not need every single data point to calculate uniques. > > On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <[email protected]> > wrote: > >> Yes -- we disabled it because there wasn't a use case. We have one now :) >> >> On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <[email protected]> wrote: >> >>> > I believe there is already an EL-Kafka pipeline and this would make >>> it easy to integrate page views with our regular processing. >>> >>> Note that the pipeline was disabled 6 months ago and thus my comment "in >>> the near term" >>> >>> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e >>> >>> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <[email protected]> >>> wrote: >>> >>>> I'd also like us to consider routing this dataset to hadoop. I believe >>>> there is already an EL-Kafka pipeline and this would make it easy to >>>> integrate page views with our regular processing. >>>> >>>> Gilles -- are mobile page views included in your stream? >>>> >>>> -Toby >>>> >>>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]> wrote: >>>> >>>>> >Great, then I guess it's a matter of only making the data go to >>>>> files and not to DB for the particular schema we'll create. Does >that >>>>> sound like something feasible? How much work would be required to set it >>>>> up? >>>>> I do not think this is feasible on the near term w/o changes in our >>>>> end. I also am not sure it is really needed. You are concern about sending >>>>> stuff to db due to "volume", correct? I do not understand why logging >>>>> every >>>>> single data point would be needed. Maybe you can explain that with a bit >>>>> more detail for us to grasp the use case? >>>>> >>>>> If it is a matter of identifying distinct requests that can be done >>>>> having sampled your dataset if it is large enough, we can help with that >>>>> and leila just put together some docs on this regard, while this is for >>>>> hive queries principles can apply elsewhere: >>>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques >>>>> >>>>> >>>>> >>>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]> >>>>> wrote: >>>>> >>>>>> Right -- couldn't we just tag the URL? >>>>>>> >>>>>> >>>>>> The event of the user actually viewing the image is completely >>>>>> disconnected from the URL hit in Media Viewer, which is why we need EL >>>>>> and >>>>>> can't rely on existing server logs. >>>>>> >>>>>> >>>>>>> Eventlogging data currently does go to files, as well as to the DB. >>>>>>> >>>>>> >>>>>> Great, then I guess it's a matter of only making the data go to files >>>>>> and not to DB for the particular schema we'll create. Does that sound >>>>>> like >>>>>> something feasible? How much work would be required to set it up? >>>>>> >>>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Eventlogging data currently does go to files, as well as to the DB. >>>>>>> Check it out on stat1003 at /srv/eventlogging/archive. >>>>>>> >>>>>>> If you need something with higher throughput then eventlogging >>>>>>> itself supports…then let’s talk :D >>>>>>> >>>>>>> -Ao >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]> wrote: >>>>>>> >>>>>>> You mean attach an X-analytics parameter, for extra images beyond >>>>>>> the one the user initially requested. >>>>>>> >>>>>>> But then we would undercount, basically missing all image views from >>>>>>> clicking right arrow in image viewer. >>>>>>> I'm not sure how much we would miss then. >>>>>>> iirc Gilles said this browsing feature was used quite a long, but >>>>>>> I'm not sure. >>>>>>> >>>>>>> >>>>>>> *From:* [email protected] [ >>>>>>> mailto:[email protected] >>>>>>> <[email protected]>] *On Behalf Of *Toby Negrin >>>>>>> *Sent:* Tuesday, January 06, 2015 19:16 >>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>>>>> who has an interest in Wikipedia and analytics. >>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>>>>>> instead of the DB >>>>>>> >>>>>>> >>>>>>> >>>>>>> Right -- couldn't we just tag the URL? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Just to clarify, this is about prefetched images which have not been >>>>>>> shown to the public. >>>>>>> >>>>>>> They were sent to the browser ahead of a possible request to speed >>>>>>> things up but in many cases never actually requested. >>>>>>> >>>>>>> >>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images >>>>>>> >>>>>>> - Erik >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* [email protected] [mailto: >>>>>>> [email protected]] *On Behalf Of *Toby Negrin >>>>>>> *Sent:* Tuesday, January 06, 2015 18:49 >>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody >>>>>>> who has an interest in Wikipedia and analytics. >>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>>>>>> instead of the DB >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Gilles -- why won't the page view logs work by themselves for >>>>>>> this purpose? EL can be configured to write into Hadoop which is >>>>>>> probably >>>>>>> the best way to get the throughput you need but it seems >>>>>>> overcomplicated. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Toby >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> This depends on [1] so we're not going to need that immediately, but >>>>>>> in order to help Erik Zachte with his RfC [2] to track unique media >>>>>>> views >>>>>>> in Media Viewer, I'm going to need to use something almost exactly like >>>>>>> EventLogging. The main difference being that it should skip writing to >>>>>>> the >>>>>>> database and write to a log file instead. >>>>>>> >>>>>>> That's because we'll be recording around 20-25M image views per day, >>>>>>> which would needlessly overload EventLogging for little purpose since >>>>>>> the >>>>>>> data will be used for offline stats generation and doesn't need to be >>>>>>> made >>>>>>> available in a relational database. Of course if storage space and >>>>>>> EventLogging capacity were no object, we could just use EL and keep the >>>>>>> ever-growing table forever, but I have the impression that we want to be >>>>>>> reasonable here and only write to a log, since that's what Erik needs. >>>>>>> >>>>>>> So here's the question: for a specific schema, can EventLogging work >>>>>>> the way it does but only record hits to a log file (maybe it already >>>>>>> does >>>>>>> that before hitting the DB?) and not write to the DB? If not, how >>>>>>> difficult >>>>>>> would it be to make EL capable of doing that? >>>>>>> >>>>>>> >>>>>>> [1] https://phabricator.wikimedia.org/T44815 >>>>>>> [2] >>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
