Hi Erik, Erik Bernhardson <ebernhard...@wikimedia.org> schrieb am Mi., 14. März 2018 um 15:34 Uhr:
> Sorry for the delayed response, I've been out the last week. Responses > inline. > > On Mon, Mar 12, 2018 at 1:27 AM, Georg Sorst <g.so...@findologic.com> > wrote: > >> Erik, >> >> is there some documentation / further reading available on the machine >> ranking used for Wikipedia? This sounds very interesting! >> >> The code for managing all the data and training models is in > https://github.com/wikimedia/search-mjolnir. This is a pyspark > application that starts with the logged click data and transforms it into > trained models. The models are currently trained using xgboost, but we are > considering lightgbm as a replacement. Collecting click data is done > separately with some processing of web request logs to match up search > requests with their clicks. > Great stuff, thank you! > > And can you elaborate on how the aggregated search queries are PII? >> >> The problem is that any aggregation of search queries that wants to be > used to learn a ranking function needs to be provided the original query > string. That string is then not aggregated, it is passed straight through > from the users keyboard to the output data. We unfortunately don't have the > kind of search volume, and don't keep long enough records (only 90 days) , > to place arbitrary limits for minimum unique sessions issuing a query, > and still have data that is representative of the whole. For example on > english wikipedia, which is by far the most popular, only 60% of search > sessions involve a query that was issued more than 10 times in the last 90 > days. And 10 times is *way* too low for public release (I'm not sure where > a reasonable cutoff might be, but its certainly not 10). > > Thank you! > >> Georg >> >> >> Georg Sorst <g.so...@findologic.com> schrieb am Mo., 5. März 2018 um >> 20:31 Uhr: >> >>> Hi all, >>> >>> sorry for this messy post - I forgot to subscribe to the list so I can't >>> directly reply to your responses. >>> >>> Nuria: >>> >>> > Datasets do not include simple wiki, there are calculated for a few >>> wikis >>> some or which are not very large so you might be able to use them. >>> >>> Is the raw data available? Can I compute the clickstream myself? >>> >>> Erik: >>> >>> > This is actually how our production search ranking is built for around >>> the >>> top 20 sites by search volume that we host. Simple wikipedia isn't one of >>> those we currently use machine ranking for though. >>> >>> Awesome! Is there more info available somewhere? Algorithms used etc. >>> maybe even source code? >>> >>> > We use a DBN (chapelle, 2009) to transform click stream data into labeled > search result data, and then LambdaMART for the final ranking model. Link > to mjolnir which does the training linked above. > > >> > Because of that we do have the data you need, but the problem will be >>> that the actual search >>> queries are considered PII (Personally Identifiable Information) and not >>> something I can release publicly. It may be possible to release >>> aggregated >>> data sets that don't include the actual search terms, but at that point I >>> don't think the data will be useful to you anymore. >>> >>> I think I'm fine with query-document pairs. Isn't that sufficiently >>> aggregated to not be considered PII? >>> >>> As mentioned above, the query is the hard part. Query strings contain > arbitrary information and if you want to build a ranking function you have > to have those original queries to do feature collection. > Just for my understanding (not a Machine Learning expert yet :) ): I would need (query -> document) pairs such as ("machine learning" -> https://en.wikipedia.org/wiki/Machine_learning) and how often each of these pairs has ocurred, right? Even if this pair has only occured once, how is this PII? Or do I need more than just (query -> document)? Thank you so much, this is all very enlightening! Georg > > >> Thank you! >>> Georg >>> >>> >>> Georg Sorst <g.so...@findologic.com> schrieb am Mi., 28. Feb. 2018 um >>> 12:17 Uhr: >>> >>>> Hi list, >>>> >>>> as part of a lecture on Information Retrieval I am giving we work a lot >>>> with Simple Wikipedia articles. It's a great data set because it's >>>> comprehensive and not domain specific so when building search on top of it >>>> humans can easily judge result quality, and it's still small enough to be >>>> handled by a regular computer. >>>> >>>> This year I want to cover the topic of Machine Learning for search. The >>>> idea is to look at result clicks from an internal search search engine, >>>> feed that into the Machine Learning and adjust search accordingly so that >>>> the top-clicked results actually rank best. We will be using Solr LTR for >>>> this purpose. >>>> >>>> I would love to base this on Simple Wikipedia data since it would fit >>>> well into the rest of the lecture. Unfortunately, I could not find that >>>> data. The closest I came is >>>> https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but >>>> this covers neither Simple Wikipedia nor does it specify internal search >>>> queries. >>>> >>>> Did I miss something? Is this data available somewhere? Can I produce >>>> it myself from raw data? Ideally I would need (query-document) pairs with >>>> the number of occurrences. >>>> >>>> Thank you! >>>> Georg >>>> -- >>>> *Georg M. Sorst I CTO* >>>> [image: FINDOLOGIC Logo] >>>> >>>> Jakob-Haringer-Str. 5a | 5020 >>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >>>> Salzburg >>>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >>>> I T.: +43 662 456708 <+43%20662%20456708> >>>> E.: g.so...@findologic.com >>>> www.findologic.com Folgen Sie uns auf: XING >>>> <https://www.xing.com/profile/Georg_Sorst> facebook >>>> <http://www.facebook.com/Findologic/> Twitter >>>> <https://twitter.com/findologic> >>>> >>>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle >>>> A6 Stand E130 in München*! Hier >>>> <berat...@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin >>>> vereinbaren! >>>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! >>>> Hier <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin >>>> vereinbaren! >>>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand >>>> G.17 in Zürich*! Hier >>>> <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! >>>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*! >>>> >>> -- >>> *Georg M. Sorst I CTO* >>> [image: FINDOLOGIC Logo] >>> >>> Jakob-Haringer-Str. 5a | 5020 >>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >>> Salzburg >>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >>> I T.: +43 662 456708 <+43%20662%20456708> >>> E.: g.so...@findologic.com >>> www.findologic.com Folgen Sie uns auf: XING >>> <https://www.xing.com/profile/Georg_Sorst> facebook >>> <http://www.facebook.com/Findologic/> Twitter >>> <https://twitter.com/findologic> >>> >>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle >>> A6 Stand E130 in München*! Hier >>> <berat...@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin >>> vereinbaren! >>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! >>> Hier <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin >>> vereinbaren! >>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand >>> G.17 in Zürich*! Hier >>> <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! >>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*! >>> >> -- >> *Georg M. Sorst I CTO* >> FINDOLOGIC GmbH >> >> >> Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708 >> E.: g.so...@findologic.com >> www.findologic.com Folgen Sie uns auf: XING >> <https://www.xing.com/profile/Georg_Sorst>facebook >> <https://www.facebook.com/Findologic> Twitter >> <https://twitter.com/findologic> >> >> Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier >> <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin >> vereinbaren! >> Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in >> Zürich*! Hier <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin >> vereinbaren! >> Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! >> Hier <berat...@findologic.com?subject=Shopware%20Community%20Day> Termin >> vereinbaren! >> Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier >> <berat...@findologic.com?subject=OXID%20Commons> Termin vereinbaren! >> Hier <http://www.findologic.com> geht es zu unserer *Homepage*! >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- *Georg M. Sorst I CTO* FINDOLOGIC GmbH Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708 E.: g.so...@findologic.com www.findologic.com Folgen Sie uns auf: XING <https://www.xing.com/profile/Georg_Sorst>facebook <https://www.facebook.com/Findologic> Twitter <https://twitter.com/findologic> Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren! Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in Zürich*! Hier <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin vereinbaren! Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier <berat...@findologic.com?subject=Shopware%20Community%20Day> Termin vereinbaren! Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier <berat...@findologic.com?subject=OXID%20Commons> Termin vereinbaren! Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics