>If I were to venture into writing a changeset for this (made into a task:
https://phabricator.wikimedia.org/T87177 ), is everything >self-contained
in the EventLogging extension
For the proposed solution of sending events to kafka/hadoop thee answer
will that there is work to do in the EL extension, puppet and likely
refinery as you would need to create a partition where your data might go.
I think a meeting will be in order to
get a concrete idea of what is what we want to do.



On Mon, Jan 19, 2015 at 5:44 AM, Gilles Dubuc <[email protected]> wrote:

> If I were to venture into writing a changeset for this (made into a task:
> https://phabricator.wikimedia.org/T87177 ), is everything self-contained
> in the EventLogging extension or are there external parts involved in the
> current pipeline sending events to the DB in production that I need to be
> aware of?
>
> On Fri, Jan 9, 2015 at 8:40 AM, Gilles Dubuc <[email protected]> wrote:
>
>> I think Gilles and Erik want to calculate page views for GLAM mainly
>>> (although there are some other good reasons too) -- sampling would probably
>>> be ok but we'd miss the long tail of views.
>>>
>>
>> That's correct. We're looking to compile media view counts as accurate as
>> the ones we have for article views at the moment. Sampling would be fine to
>> identify the X most viewed media across a wiki, but it definitely wouldn't
>> help small GLAMs who want to get that information about their own
>> collection, if their media happen to be "low traffic" in the grand scheme
>> of things. I think that the latter is the main use case for doing this,
>> which is why I'm looking for a solution that wouldn't involve sampling.
>>
>> Compiling the top list has entertainment value, letting GLAM contributors
>> get accurate statistics about their content improves the chances that they
>> will keep contributing more. I think that's more valuable than the
>> entertainment factor of the top list.
>>
>> On Wed, Jan 7, 2015 at 8:02 PM, Toby Negrin <[email protected]>
>> wrote:
>>
>>> I think Gilles and Erik want to calculate page views for GLAM mainly
>>> (although there are some other good reasons too) -- sampling would probably
>>> be ok but we'd miss the long tail of views.
>>>
>>> On Wed, Jan 7, 2015 at 10:56 AM, Nuria Ruiz <[email protected]> wrote:
>>>
>>>> I see. My main point was that -regardless of collection method- we
>>>> might not need every single data point to calculate uniques.
>>>>
>>>> On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes -- we disabled it because there wasn't a use case. We have one now
>>>>> :)
>>>>>
>>>>> On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > I believe there is already an EL-Kafka pipeline and this would
>>>>>> make it easy to integrate page views with our regular processing.
>>>>>>
>>>>>> Note that the pipeline was disabled 6 months ago and thus my comment
>>>>>> "in the near term"
>>>>>>
>>>>>> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e
>>>>>>
>>>>>> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I'd also like us to consider routing this dataset to hadoop. I
>>>>>>> believe there is already an EL-Kafka pipeline and this would make it 
>>>>>>> easy
>>>>>>> to integrate page views with our regular processing.
>>>>>>>
>>>>>>> Gilles -- are mobile page views included in your stream?
>>>>>>>
>>>>>>> -Toby
>>>>>>>
>>>>>>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> >Great, then I guess it's a matter of only making the data go to
>>>>>>>> files and not to DB for the particular schema we'll create. Does >that
>>>>>>>> sound like something feasible? How much work would be required to set 
>>>>>>>> it up?
>>>>>>>> I do not think this is feasible on the near term w/o changes in our
>>>>>>>> end. I also am not sure it is really needed. You are concern about 
>>>>>>>> sending
>>>>>>>> stuff to db due to "volume", correct? I do not understand why logging 
>>>>>>>> every
>>>>>>>> single data point would be needed. Maybe you can explain that with a 
>>>>>>>> bit
>>>>>>>> more detail for us to grasp the use case?
>>>>>>>>
>>>>>>>> If it is a matter of identifying distinct requests that can be done
>>>>>>>> having sampled your dataset if it is large enough, we can help with 
>>>>>>>> that
>>>>>>>> and leila just put together some docs on this regard, while this is for
>>>>>>>> hive queries principles can apply elsewhere:
>>>>>>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Right -- couldn't we just tag the URL?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The event of the user actually viewing the image is completely
>>>>>>>>> disconnected from the URL hit in Media Viewer, which is why we need 
>>>>>>>>> EL and
>>>>>>>>> can't rely on existing server logs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Eventlogging data currently does go to files, as well as to the
>>>>>>>>>> DB.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Great, then I guess it's a matter of only making the data go to
>>>>>>>>> files and not to DB for the particular schema we'll create. Does that 
>>>>>>>>> sound
>>>>>>>>> like something feasible? How much work would be required to set it up?
>>>>>>>>>
>>>>>>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Eventlogging data currently does go to files, as well as to the
>>>>>>>>>> DB.  Check it out on stat1003 at /srv/eventlogging/archive.
>>>>>>>>>>
>>>>>>>>>> If you need something with higher throughput then eventlogging
>>>>>>>>>> itself supports…then let’s talk :D
>>>>>>>>>>
>>>>>>>>>> -Ao
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> You mean attach an X-analytics parameter, for extra images beyond
>>>>>>>>>> the one the user initially requested.
>>>>>>>>>>
>>>>>>>>>> But then we would undercount, basically missing all image views
>>>>>>>>>> from clicking right arrow in image viewer.
>>>>>>>>>> I'm not sure how much we would miss then.
>>>>>>>>>> iirc Gilles said this browsing feature was used quite a long, but
>>>>>>>>>> I'm not sure.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* [email protected] [
>>>>>>>>>> mailto:[email protected]
>>>>>>>>>> <[email protected]>] *On Behalf Of *Toby
>>>>>>>>>> Negrin
>>>>>>>>>> *Sent:* Tuesday, January 06, 2015 19:16
>>>>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody
>>>>>>>>>> who has an interest in Wikipedia and analytics.
>>>>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log
>>>>>>>>>> file instead of the DB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Right -- couldn't we just tag the URL?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> Just to clarify, this is about prefetched images which have not
>>>>>>>>>> been shown to the public.
>>>>>>>>>>
>>>>>>>>>> They were sent to the browser ahead of a possible request to
>>>>>>>>>> speed things up but in many cases never actually requested.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
>>>>>>>>>>
>>>>>>>>>> - Erik
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* [email protected] [mailto:
>>>>>>>>>> [email protected]] *On Behalf Of *Toby Negrin
>>>>>>>>>> *Sent:* Tuesday, January 06, 2015 18:49
>>>>>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody
>>>>>>>>>> who has an interest in Wikipedia and analytics.
>>>>>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log
>>>>>>>>>> file instead of the DB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Gilles -- why won't the page view logs work by themselves for
>>>>>>>>>> this purpose? EL can be configured to write into Hadoop which is 
>>>>>>>>>> probably
>>>>>>>>>> the best way to get the throughput you need but it seems 
>>>>>>>>>> overcomplicated.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Toby
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> This depends on [1] so we're not going to need that immediately,
>>>>>>>>>> but in order to help Erik Zachte with his RfC [2] to track unique 
>>>>>>>>>> media
>>>>>>>>>> views in Media Viewer, I'm going to need to use something almost 
>>>>>>>>>> exactly
>>>>>>>>>> like EventLogging. The main difference being that it should skip 
>>>>>>>>>> writing to
>>>>>>>>>> the database and write to a log file instead.
>>>>>>>>>>
>>>>>>>>>> That's because we'll be recording around 20-25M image views per
>>>>>>>>>> day, which would needlessly overload EventLogging for little purpose 
>>>>>>>>>> since
>>>>>>>>>> the data will be used for offline stats generation and doesn't need 
>>>>>>>>>> to be
>>>>>>>>>> made available in a relational database. Of course if storage space 
>>>>>>>>>> and
>>>>>>>>>> EventLogging capacity were no object, we could just use EL and keep 
>>>>>>>>>> the
>>>>>>>>>> ever-growing table forever, but I have the impression that we want 
>>>>>>>>>> to be
>>>>>>>>>> reasonable here and only write to a log, since that's what Erik 
>>>>>>>>>> needs.
>>>>>>>>>>
>>>>>>>>>> So here's the question: for a specific schema, can EventLogging
>>>>>>>>>> work the way it does but only record hits to a log file (maybe it 
>>>>>>>>>> already
>>>>>>>>>> does that before hitting the DB?) and not write to the DB? If not, 
>>>>>>>>>> how
>>>>>>>>>> difficult would it be to make EL capable of doing that?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://phabricator.wikimedia.org/T44815
>>>>>>>>>> [2]
>>>>>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Analytics mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Analytics mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Analytics mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Analytics mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to