Short answer, no, this data is not available publicy such you can compute the dataset yourself as it is Private data.
Thanks, Nuria On Mon, Mar 5, 2018 at 11:31 AM, Georg Sorst <[email protected]> wrote: > Hi all, > > sorry for this messy post - I forgot to subscribe to the list so I can't > directly reply to your responses. > > Nuria: > > > Datasets do not include simple wiki, there are calculated for a few wikis > some or which are not very large so you might be able to use them. > > Is the raw data available? Can I compute the clickstream myself? > > Erik: > > > This is actually how our production search ranking is built for around > the > top 20 sites by search volume that we host. Simple wikipedia isn't one of > those we currently use machine ranking for though. > > Awesome! Is there more info available somewhere? Algorithms used etc. > maybe even source code? > > > Because of that we do have the data you need, but the problem will be > that the actual search > queries are considered PII (Personally Identifiable Information) and not > something I can release publicly. It may be possible to release aggregated > data sets that don't include the actual search terms, but at that point I > don't think the data will be useful to you anymore. > > I think I'm fine with query-document pairs. Isn't that sufficiently > aggregated to not be considered PII? > > Thank you! > Georg > > > Georg Sorst <[email protected]> schrieb am Mi., 28. Feb. 2018 um > 12:17 Uhr: > >> Hi list, >> >> as part of a lecture on Information Retrieval I am giving we work a lot >> with Simple Wikipedia articles. It's a great data set because it's >> comprehensive and not domain specific so when building search on top of it >> humans can easily judge result quality, and it's still small enough to be >> handled by a regular computer. >> >> This year I want to cover the topic of Machine Learning for search. The >> idea is to look at result clicks from an internal search search engine, >> feed that into the Machine Learning and adjust search accordingly so that >> the top-clicked results actually rank best. We will be using Solr LTR for >> this purpose. >> >> I would love to base this on Simple Wikipedia data since it would fit >> well into the rest of the lecture. Unfortunately, I could not find that >> data. The closest I came is https://meta.wikimedia.org/ >> wiki/Research:Wikipedia_clickstream but this covers neither Simple >> Wikipedia nor does it specify internal search queries. >> >> Did I miss something? Is this data available somewhere? Can I produce it >> myself from raw data? Ideally I would need (query-document) pairs with the >> number of occurrences. >> >> Thank you! >> Georg >> -- >> *Georg M. Sorst I CTO* >> [image: FINDOLOGIC Logo] >> >> Jakob-Haringer-Str. 5a | 5020 >> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >> Salzburg >> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> >> I T.: +43 662 456708 <+43%20662%20456708> >> E.: [email protected] >> www.findologic.com Folgen Sie uns auf: XING >> <https://www.xing.com/profile/Georg_Sorst> facebook >> <http://www.facebook.com/Findologic/> Twitter >> <https://twitter.com/findologic> >> >> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle >> A6 Stand E130 in München*! Hier >> <[email protected]?subject=Internet%20World%20M%C3%BCnchen> Termin >> vereinbaren! >> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! >> Hier <[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin >> vereinbaren! >> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand >> G.17 in Zürich*! Hier <[email protected]?subject=SOM%20Z%C3%BCrich> >> Termin >> vereinbaren! >> Hier <http://www.findologic.com> geht es zu unserer *Homepage*! >> > -- > *Georg M. Sorst I CTO* > [image: FINDOLOGIC Logo] > > Jakob-Haringer-Str. 5a | 5020 > <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> > Salzburg > <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g> > I T.: +43 662 456708 <+43%20662%20456708> > E.: [email protected] > www.findologic.com Folgen Sie uns auf: XING > <https://www.xing.com/profile/Georg_Sorst> facebook > <http://www.facebook.com/Findologic/> Twitter > <https://twitter.com/findologic> > > Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle > A6 Stand E130 in München*! Hier > <[email protected]?subject=Internet%20World%20M%C3%BCnchen> Termin > vereinbaren! > Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier > <[email protected]?subject=SHOPTALK%20Las%20Vegas> Termin > vereinbaren! > Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17 > in Zürich*! Hier <[email protected]?subject=SOM%20Z%C3%BCrich> Termin > vereinbaren! > Hier <http://www.findologic.com> geht es zu unserer *Homepage*! > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
