I think Gilles and Erik want to calculate page views for GLAM mainly
(although there are some other good reasons too) -- sampling would probably
be ok but we'd miss the long tail of views.

On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz <[email protected]> wrote:

> I see. My main point was that -regardless of collection method- we might
> not need every single data point to calculate uniques.
>
> On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <[email protected]>
> wrote:
>
>> Yes -- we disabled it because there wasn't a use case. We have one now :)
>>
>> On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <[email protected]> wrote:
>>
>>> > I believe there is already an EL-Kafka pipeline and this would make
>>> it easy to integrate page views with our regular processing.
>>>
>>> Note that the pipeline was disabled 6 months ago and thus my comment "in
>>> the near term"
>>>
>>> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e
>>>
>>> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <[email protected]>
>>> wrote:
>>>
>>>> I'd also like us to consider routing this dataset to hadoop. I believe
>>>> there is already an EL-Kafka pipeline and this would make it easy to
>>>> integrate page views with our regular processing.
>>>>
>>>> Gilles -- are mobile page views included in your stream?
>>>>
>>>> -Toby
>>>>
>>>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]> wrote:
>>>>
>>>>> >Great, then I guess it's a matter of only making the data go to
>>>>> files and not to DB for the particular schema we'll create. Does >that
>>>>> sound like something feasible? How much work would be required to set it 
>>>>> up?
>>>>> I do not think this is feasible on the near term w/o changes in our
>>>>> end. I also am not sure it is really needed. You are concern about sending
>>>>> stuff to db due to "volume", correct? I do not understand why logging 
>>>>> every
>>>>> single data point would be needed. Maybe you can explain that with a bit
>>>>> more detail for us to grasp the use case?
>>>>>
>>>>> If it is a matter of identifying distinct requests that can be done
>>>>> having sampled your dataset if it is large enough, we can help with that
>>>>> and leila just put together some docs on this regard, while this is for
>>>>> hive queries principles can apply elsewhere:
>>>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Right -- couldn't we just tag the URL?
>>>>>>>
>>>>>>
>>>>>> The event of the user actually viewing the image is completely
>>>>>> disconnected from the URL hit in Media Viewer, which is why we need EL 
>>>>>> and
>>>>>> can't rely on existing server logs.
>>>>>>
>>>>>>
>>>>>>> Eventlogging data currently does go to files, as well as to the DB.
>>>>>>>
>>>>>>
>>>>>> Great, then I guess it's a matter of only making the data go to files
>>>>>> and not to DB for the particular schema we'll create. Does that sound 
>>>>>> like
>>>>>> something feasible? How much work would be required to set it up?
>>>>>>
>>>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Eventlogging data currently does go to files, as well as to the DB.
>>>>>>> Check it out on stat1003 at /srv/eventlogging/archive.
>>>>>>>
>>>>>>> If you need something with higher throughput then eventlogging
>>>>>>> itself supports…then let’s talk :D
>>>>>>>
>>>>>>> -Ao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]> wrote:
>>>>>>>
>>>>>>> You mean attach an X-analytics parameter, for extra images beyond
>>>>>>> the one the user initially requested.
>>>>>>>
>>>>>>> But then we would undercount, basically missing all image views from
>>>>>>> clicking right arrow in image viewer.
>>>>>>> I'm not sure how much we would miss then.
>>>>>>> iirc Gilles said this browsing feature was used quite a long, but
>>>>>>> I'm not sure.
>>>>>>>
>>>>>>>
>>>>>>> *From:* [email protected] [
>>>>>>> mailto:[email protected]
>>>>>>> <[email protected]>] *On Behalf Of *Toby Negrin
>>>>>>> *Sent:* Tuesday, January 06, 2015 19:16
>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody
>>>>>>> who has an interest in Wikipedia and analytics.
>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>>>>>> instead of the DB
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Right -- couldn't we just tag the URL?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just to clarify, this is about prefetched images which have not been
>>>>>>> shown to the public.
>>>>>>>
>>>>>>> They were sent to the browser ahead of a possible request to speed
>>>>>>> things up but in many cases never actually requested.
>>>>>>>
>>>>>>>
>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* [email protected] [mailto:
>>>>>>> [email protected]] *On Behalf Of *Toby Negrin
>>>>>>> *Sent:* Tuesday, January 06, 2015 18:49
>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody
>>>>>>> who has an interest in Wikipedia and analytics.
>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>>>>>> instead of the DB
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Gilles -- why won't the page view logs work by themselves for
>>>>>>> this purpose? EL can be configured to write into Hadoop which is 
>>>>>>> probably
>>>>>>> the best way to get the throughput you need but it seems 
>>>>>>> overcomplicated.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Toby
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> This depends on [1] so we're not going to need that immediately, but
>>>>>>> in order to help Erik Zachte with his RfC [2] to track unique media 
>>>>>>> views
>>>>>>> in Media Viewer, I'm going to need to use something almost exactly like
>>>>>>> EventLogging. The main difference being that it should skip writing to 
>>>>>>> the
>>>>>>> database and write to a log file instead.
>>>>>>>
>>>>>>> That's because we'll be recording around 20-25M image views per day,
>>>>>>> which would needlessly overload EventLogging for little purpose since 
>>>>>>> the
>>>>>>> data will be used for offline stats generation and doesn't need to be 
>>>>>>> made
>>>>>>> available in a relational database. Of course if storage space and
>>>>>>> EventLogging capacity were no object, we could just use EL and keep the
>>>>>>> ever-growing table forever, but I have the impression that we want to be
>>>>>>> reasonable here and only write to a log, since that's what Erik needs.
>>>>>>>
>>>>>>> So here's the question: for a specific schema, can EventLogging work
>>>>>>> the way it does but only record hits to a log file (maybe it already 
>>>>>>> does
>>>>>>> that before hitting the DB?) and not write to the DB? If not, how 
>>>>>>> difficult
>>>>>>> would it be to make EL capable of doing that?
>>>>>>>
>>>>>>>
>>>>>>> [1] https://phabricator.wikimedia.org/T44815
>>>>>>> [2]
>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to