Re: [Analytics] Wikipedia internal search clickstream

Georg Sorst Thu, 15 Mar 2018 06:41:26 -0700

Hi Erik,

Erik Bernhardson <[email protected]> schrieb am Mi., 14. März 2018
um 15:34 Uhr:


> Sorry for the delayed response, I've been out the last week. Responses
> inline.
>
> On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst <[email protected]>
> wrote:
>
>> Erik,
>>
>> is there some documentation / further reading available on the machine
>> ranking used for Wikipedia? This sounds very interesting!
>>
>> The code for managing all the data and training models is in
> https://github.com/wikimedia/search-mjolnir. This is a pyspark
> application that starts with the logged click data and transforms it into
> trained models. The models are currently trained using xgboost, but we are
> considering lightgbm as a replacement. Collecting click data is done
> separately with some processing of web request logs to match up search
> requests with their clicks.
>

Great stuff, thank you!


>
> And can you elaborate on how the aggregated search queries are PII?
>>
>> The problem is that any aggregation of search queries that wants to be
> used to learn a ranking function needs to be provided the original query
> string. That string is then not aggregated, it is passed straight through
> from the users keyboard to the output data. We unfortunately don't have the
> kind of search volume, and don't keep long enough records (only 90 days) ,
> to place arbitrary limits for minimum unique sessions issuing a query,
> and still have data that is representative of the whole. For example on
> english wikipedia, which is by far the most popular, only 60% of search
> sessions involve a query that was issued more than 10 times in the last 90
> days. And 10 times is *way* too low for public release (I'm not sure where
> a reasonable cutoff might be, but its certainly not 10).
>

> Thank you!
>
>> Georg
>>
>>
>> Georg Sorst <[email protected]> schrieb am Mo., 5. März 2018 um
>> 20:31 Uhr:
>>
>>> Hi all,
>>>
>>> sorry for this messy post - I forgot to subscribe to the list so I can't
>>> directly reply to your responses.
>>>
>>> Nuria:
>>>
>>> > Datasets do not include simple wiki, there are calculated for a few
>>> wikis
>>> some or which are not very large so you might be able to use them.
>>>
>>> Is the raw data available? Can I compute the clickstream myself?
>>>
>>> Erik:
>>>
>>> > This is actually how our production search ranking is built for around
>>> the
>>> top 20 sites by search volume that we host. Simple wikipedia isn't one of
>>> those we currently use machine ranking for though.
>>>
>>> Awesome! Is there more info available somewhere? Algorithms used etc.
>>> maybe even source code?
>>>
>>>
> We use a DBN (chapelle, 2009) to transform click stream data into labeled
> search result data, and then LambdaMART for the final ranking model. Link
> to mjolnir which does the training linked above.
>
>
>> > Because of that we do have the data you need, but the problem will be
>>> that the actual search
>>> queries are considered PII (Personally Identifiable Information) and not
>>> something I can release publicly. It may be possible to release
>>> aggregated
>>> data sets that don't include the actual search terms, but at that point I
>>> don't think the data will be useful to you anymore.
>>>
>>> I think I'm fine with query-document pairs. Isn't that sufficiently
>>> aggregated to not be considered PII?
>>>
>>> As mentioned above, the query is the hard part. Query strings contain
> arbitrary information and if you want to build a ranking function you have
> to have those original queries to do feature collection.
>

Just for my understanding (not a Machine Learning expert yet :) ): I would
need (query -> document) pairs such as ("machine learning" ->
https://en.wikipedia.org/wiki/Machine_learning) and how often each of these
pairs has ocurred, right? Even if this pair has only occured once, how is
this PII? Or do I need more than just (query -> document)?

Thank you so much, this is all very enlightening!
Georg


>
>
>> Thank you!
>>> Georg
>>>
>>>
>>> Georg Sorst <[email protected]> schrieb am Mi., 28. Feb. 2018 um
>>> 12:17 Uhr:
>>>
>>>> Hi list,
>>>>
>>>> as part of a lecture on Information Retrieval I am giving we work a lot
>>>> with Simple Wikipedia articles. It's a great data set because it's
>>>> comprehensive and not domain specific so when building search on top of it
>>>> humans can easily judge result quality, and it's still small enough to be
>>>> handled by a regular computer.
>>>>
>>>> This year I want to cover the topic of Machine Learning for search. The
>>>> idea is to look at result clicks from an internal search search engine,
>>>> feed that into the Machine Learning and adjust search accordingly so that
>>>> the top-clicked results actually rank best. We will be using Solr LTR for
>>>> this purpose.
>>>>
>>>> I would love to base this on Simple Wikipedia data since it would fit
>>>> well into the rest of the lecture. Unfortunately, I could not find that
>>>> data. The closest I came is
>>>> https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but
>>>> this covers neither Simple Wikipedia nor does it specify internal search
>>>> queries.
>>>>
>>>> Did I miss something? Is this data available somewhere? Can I produce
>>>> it myself from raw data? Ideally I would need (query-document) pairs with
>>>> the number of occurrences.
>>>>
>>>> Thank you!
>>>> Georg
>>>> --
>>>> *Georg M. Sorst I CTO*
>>>> [image: FINDOLOGIC Logo]
>>>>
>>>> Jakob-Haringer-Str. 5a | 5020
>>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>>>>  Salzburg
>>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>>>> I T.: +43 662 456708 <+43%20662%20456708>
>>>> E.: [email protected]
>>>> www.findologic.com Folgen Sie uns auf: XING
>>>> <https://www.xing.com/profile/Georg_Sorst> facebook
>>>> <http://www.facebook.com/Findologic/> Twitter
>>>> <https://twitter.com/findologic>
>>>>
>>>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
>>>> A6 Stand E130 in München*! Hier
>>>> <[email protected]?subject=Internet%20World%20M%C3%BCnchen> Termin
>>>> vereinbaren!
>>>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
>>>> Hier <[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin
>>>> vereinbaren!
>>>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
>>>> G.17 in Zürich*! Hier
>>>> <[email protected]?subject=SOM%20Z%C3%BCrich> Termin vereinbaren!
>>>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
>>>>
>>> --
>>> *Georg M. Sorst I CTO*
>>> [image: FINDOLOGIC Logo]
>>>
>>> Jakob-Haringer-Str. 5a | 5020
>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>>>  Salzburg
>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>>> I T.: +43 662 456708 <+43%20662%20456708>
>>> E.: [email protected]
>>> www.findologic.com Folgen Sie uns auf: XING
>>> <https://www.xing.com/profile/Georg_Sorst> facebook
>>> <http://www.facebook.com/Findologic/> Twitter
>>> <https://twitter.com/findologic>
>>>
>>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
>>> A6 Stand E130 in München*! Hier
>>> <[email protected]?subject=Internet%20World%20M%C3%BCnchen> Termin
>>> vereinbaren!
>>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
>>> Hier <[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin
>>> vereinbaren!
>>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
>>> G.17 in Zürich*! Hier
>>> <[email protected]?subject=SOM%20Z%C3%BCrich> Termin vereinbaren!
>>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
>>>
>> --
>> *Georg M. Sorst I CTO*
>> FINDOLOGIC GmbH
>>
>>
>> Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
>> E.: [email protected]
>> www.findologic.com Folgen Sie uns auf: XING
>> <https://www.xing.com/profile/Georg_Sorst>facebook
>> <https://www.facebook.com/Findologic> Twitter
>> <https://twitter.com/findologic>
>>
>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier
>> <[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin
>> vereinbaren!
>> Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in
>> Zürich*! Hier <[email protected]?subject=SOM%20Z%C3%BCrich> Termin
>> vereinbaren!
>> Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*!
>> Hier <[email protected]?subject=Shopware%20Community%20Day> Termin
>> vereinbaren!
>> Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier
>> <[email protected]?subject=OXID%20Commons> Termin vereinbaren!
>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
-- 
*Georg M. Sorst I CTO*
FINDOLOGIC GmbH


Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: [email protected]
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst>facebook
<https://www.facebook.com/Findologic> Twitter
<https://twitter.com/findologic>

Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier
<[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in
Zürich*! Hier <[email protected]?subject=SOM%20Z%C3%BCrich> Termin
vereinbaren!
Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier
<[email protected]?subject=Shopware%20Community%20Day> Termin
vereinbaren!
Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier
<[email protected]?subject=OXID%20Commons> Termin vereinbaren!
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikipedia internal search clickstream

Reply via email to