I re-ran the sessions job including IP in the output. Several things:

- I'm happy to report that we are correctly filtering out the WMF public
IPs, though there are about 100k hits per day from 10.x.x.x IPs (about
0.5%, LVS health checks) that we missed. We'll update the filter to include
those.

- So, who is it? I ran the IPs of the top sessions through whois and tried
to extract the org name. The results (omitting IP for privacy reasons) are
here:

https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0Ai_u2wTiMldddHNrZVNVemF4MndaMTJLNnB6eGlQOHc#gid=0

A pretty interesting list.

--
David Schoonover
[email protected]


On Thu, Apr 25, 2013 at 10:38 AM, Haitham Shammaa <[email protected]>wrote:

> Maryana, that Wikipedia article is about a TV series which is being
> broadcasted since 2006, but I don't think it's very popular.
>
> On the other hand, nobody seems to mention the crab Big Daddy in the
> Japanese internet culture.
>
> *--*
> *Haitham Shammaa*
> *Contribution Research Manager*
> *Wikimedia Foundation*
>
> *Imagine a world in which every single human being can freely share in
> the sum of all knowledge. *
> *Click the "edit" button now, and help us make it a reality!*
>
>
> On Thu, Apr 25, 2013 at 10:19 AM, Maryana Pinchuk 
> <[email protected]>wrote:
>
>> On Wed, Apr 24, 2013 at 9:17 PM, Dario Taraborelli <
>> [email protected]> wrote:
>>
>>> Dave,
>>>
>>> thanks for sharing this, the referral data is particularly fascinating.
>>> I mentioned during the quarterly review that I'd love to get a better sense
>>> of (1) the proportion of requests in the mobile request logs lacking a
>>> referral, (2) the possible causes of this gap and (3) to what extent these
>>> missing entries introduce a bias in the referral ranking.
>>>
>>> The 3rd most popular query (according to your dumps) is ビッグダディ (japanese
>>> for "Big Daddy"), which presumably refers to this guy:
>>> http://metro.co.uk/2013/03/20/giant-japanese-spider-crab-big-daddy-arrives-at-blackpool-sea-life-centre-3550751/
>>> What's interesting is that there's no such entry on the japanese
>>> Wikipedia and I am baffled that people may have landed on the website via a
>>> search engine query for a non-existing article.
>>> Do you have an explanation for this or am I misinterpreting what you
>>> mean by search query?
>>>
>>
>> There *is* an article on this on ja.wiki :) It may have been renamed
>> since then, but it's still the 2nd Google hit for ビッグダディ:
>> http://ja.wikipedia.org/wiki/%E7%97%9B%E5%BF%AB!%E3%83%93%E3%83%83%E3%82%B0%E3%83%80%E3%83%87%E3%82%A3
>>
>>
>>>
>>> Dario
>>>
>>> On Apr 24, 2013, at 8:40 PM, David Schoonover <[email protected]> wrote:
>>>
>>> Hiya all,
>>>
>>> As promised earlier today in the Analytics weekly showcase, I've got a
>>> few interesting bits of data to share from playing with the new Mobile Site
>>> Sessions dataset.
>>>
>>>
>>> # Visits to Mobile Site, 4/21/2013
>>>
>>> - Total Visits:                             51,624,103
>>> - Unique Visitors:                          37,736,120
>>> - Total Pageviews:                         104,972,033
>>> - Avg Pageviews per Session:                    2.0334
>>> - Max Pageviews in one Session:                141,882
>>>
>>> ## Standard Site
>>> - Visits:                                   51,603,221
>>> - Unique Visitors:                          37,723,188
>>> - Pageviews:                               104,910,382
>>> - Avg Pageviews per Session:                     2.033
>>>
>>> ## Alpha Site
>>> - Visits:                                          986
>>> - Unique Visitors:                                 822
>>> - Pageviews:                                     7,087
>>> - Avg Pageviews per Session:                     7.188
>>>
>>> ## Beta Site
>>> - Visits:                                       19,896
>>> - Unique Visitors:                              16,235
>>> - Pageviews:                                    54,564
>>> - Avg Pageviews per Session:                     2.742
>>>
>>>
>>> ## Notes
>>> - A session (or "visit") is defined as all activity with less than 30
>>> minutes between each hit. Intuitively speaking, a session ends when the
>>> user hasn't done anything in 30m.
>>> - As we do not set visitor_id cookies for all users, the "unique
>>> visitors" metric was calculated using hash(ip_address + users_agent) as
>>> visitor_id.
>>> - This job looked at all requests to the mobile site on 4/21/2013, which
>>> is 75.17 GB of request logs.
>>> - The job took ~17 minutes to process the day into 15.3 GB of sessions.
>>> - The summary above took maybe 10 minutes to set up/write in Hive, and
>>> the job took maybe 7 minutes.
>>>
>>>
>>> In addition to that summary, I ran a few jobs on the entry_referer field
>>> -- the URL that referred the user to us when the session started. Obvious
>>> caveats: this is only one day of data, and it's only the mobile site. Draw
>>> conclusions with care.
>>>
>>> First, I pulled out the top referring domains. It's mostly as you'd
>>> expect -- search engines -- though you'll also note that several Wikipedia
>>> mobile sites show up. My working hypothesis is that people don't tend to
>>> close tabs on smartphones; when they later come back, it is often to an
>>> open Wikipedia tab: clicking a link or perform a search means the referrer
>>> is still us.
>>>
>>> Since -- as expected -- so much of the data pertained to search engines,
>>> I also calculated the top search queries and top keywords that sent people
>>> to us. (For keywords, I've filtered out common "stop words": de, of, in,
>>> is, la, and, el, es, to, en, di, los, le, da, se, las, les, il, du, a, i,
>>> o, y, e.) In both, you see the predictable: lots of searches for porn, for
>>> "facebook", for "wiki", etc. But you also see a few things that surprised
>>> me:
>>>
>>> - Tons of Japanese. Japan is the most mobile-enabled country in the
>>> world so I guess we should have expected to see many searches in Japanese
>>> show up in the top queries. I've left them URL-encoded in the results --
>>> you'll see them as weird lines with % in them.
>>>
>>> - Apparently people search for movies and TV so they can spoil their fun
>>> by reading about them on Wikipedia. Both of "movies" and "film" show up in
>>> the top keywords; Iron Man 1, 2, AND 3 all show up in the top search
>>> queries. I didn't expect this was a major use-case, but -- wikigroaning
>>> aside -- it's an interesting fact.
>>>
>>> I'm sure we're only scratching the surface here. This is an exciting
>>> dataset, and I'm sure there's lots more to learn!
>>>
>>> The full results:
>>> - Top Referring Entry Domains:
>>> http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/mobile_sessions-2013-04-21-top_entry_domains.tsv
>>> - Top Referring Entry Search Queries:
>>> http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/mobile_sessions-2013-04-21-top_entry_search_queries.tsv
>>> - Top Referring Entry Search Keywords:
>>> http://stats.wikimedia.org/kraken-public/webrequest/mobile/views/sessions/mobile_sessions-2013-04-21-top_entry_keywords.tsv
>>>
>>> Questions are welcome!
>>>
>>>
>>> --
>>> David Schoonover
>>> [email protected]
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>
>>
>> --
>> Maryana Pinchuk
>> Associate Product Manager, Wikimedia Foundation
>> wikimediafoundation.org
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Mobile-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mobile-l

Reply via email to