I'd also like us to consider routing this dataset to hadoop. I believe
there is already an EL-Kafka pipeline and this would make it easy to
integrate page views with our regular processing.

Gilles -- are mobile page views included in your stream?

-Toby

On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <[email protected]> wrote:

> >Great, then I guess it's a matter of only making the data go to files
> and not to DB for the particular schema we'll create. Does >that sound like
> something feasible? How much work would be required to set it up?
> I do not think this is feasible on the near term w/o changes in our end. I
> also am not sure it is really needed. You are concern about sending stuff
> to db due to "volume", correct? I do not understand why logging every
> single data point would be needed. Maybe you can explain that with a bit
> more detail for us to grasp the use case?
>
> If it is a matter of identifying distinct requests that can be done having
> sampled your dataset if it is large enough, we can help with that and leila
> just put together some docs on this regard, while this is for hive queries
> principles can apply elsewhere:
> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
>
>
>
> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <[email protected]> wrote:
>
>> Right -- couldn't we just tag the URL?
>>>
>>
>> The event of the user actually viewing the image is completely
>> disconnected from the URL hit in Media Viewer, which is why we need EL and
>> can't rely on existing server logs.
>>
>>
>>> Eventlogging data currently does go to files, as well as to the DB.
>>>
>>
>> Great, then I guess it's a matter of only making the data go to files and
>> not to DB for the particular schema we'll create. Does that sound like
>> something feasible? How much work would be required to set it up?
>>
>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <[email protected]> wrote:
>>
>>> Eventlogging data currently does go to files, as well as to the DB.
>>> Check it out on stat1003 at /srv/eventlogging/archive.
>>>
>>> If you need something with higher throughput then eventlogging itself
>>> supports…then let’s talk :D
>>>
>>> -Ao
>>>
>>>
>>>
>>>
>>> On Jan 6, 2015, at 13:28, Erik Zachte <[email protected]> wrote:
>>>
>>> You mean attach an X-analytics parameter, for extra images beyond the
>>> one the user initially requested.
>>>
>>> But then we would undercount, basically missing all image views from
>>> clicking right arrow in image viewer.
>>> I'm not sure how much we would miss then.
>>> iirc Gilles said this browsing feature was used quite a long, but I'm
>>> not sure.
>>>
>>>
>>> *From:* [email protected] [
>>> mailto:[email protected]
>>> <[email protected]>] *On Behalf Of *Toby Negrin
>>> *Sent:* Tuesday, January 06, 2015 19:16
>>> *To:* A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics.
>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>> instead of the DB
>>>
>>>
>>>
>>> Right -- couldn't we just tag the URL?
>>>
>>>
>>>
>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <[email protected]>
>>> wrote:
>>>
>>> Just to clarify, this is about prefetched images which have not been
>>> shown to the public.
>>>
>>> They were sent to the browser ahead of a possible request to speed
>>> things up but in many cases never actually requested.
>>>
>>>
>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
>>>
>>> - Erik
>>>
>>>
>>>
>>> *From:* [email protected] [mailto:
>>> [email protected]] *On Behalf Of *Toby Negrin
>>> *Sent:* Tuesday, January 06, 2015 18:49
>>> *To:* A mailing list for the Analytics Team at WMF and everybody who
>>> has an interest in Wikipedia and analytics.
>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>> instead of the DB
>>>
>>>
>>>
>>> Hi Gilles -- why won't the page view logs work by themselves for this
>>> purpose? EL can be configured to write into Hadoop which is probably the
>>> best way to get the throughput you need but it seems overcomplicated.
>>>
>>>
>>>
>>> -Toby
>>>
>>>
>>>
>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <[email protected]>
>>> wrote:
>>>
>>> This depends on [1] so we're not going to need that immediately, but in
>>> order to help Erik Zachte with his RfC [2] to track unique media views in
>>> Media Viewer, I'm going to need to use something almost exactly like
>>> EventLogging. The main difference being that it should skip writing to the
>>> database and write to a log file instead.
>>>
>>> That's because we'll be recording around 20-25M image views per day,
>>> which would needlessly overload EventLogging for little purpose since the
>>> data will be used for offline stats generation and doesn't need to be made
>>> available in a relational database. Of course if storage space and
>>> EventLogging capacity were no object, we could just use EL and keep the
>>> ever-growing table forever, but I have the impression that we want to be
>>> reasonable here and only write to a log, since that's what Erik needs.
>>>
>>> So here's the question: for a specific schema, can EventLogging work the
>>> way it does but only record hits to a log file (maybe it already does that
>>> before hitting the DB?) and not write to the DB? If not, how difficult
>>> would it be to make EL capable of doing that?
>>>
>>>
>>> [1] https://phabricator.wikimedia.org/T44815
>>> [2]
>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to