Erik,

is there some documentation / further reading available on the machine
ranking used for Wikipedia? This sounds very interesting!

And can you elaborate on how the aggregated search queries are PII?

Thank you!
Georg

Georg Sorst <g.so...@findologic.com> schrieb am Mo., 5. März 2018 um
20:31 Uhr:

> Hi all,
>
> sorry for this messy post - I forgot to subscribe to the list so I can't
> directly reply to your responses.
>
> Nuria:
>
> > Datasets do not include simple wiki, there are calculated for a few wikis
> some or which are not very large so you might be able to use them.
>
> Is the raw data available? Can I compute the clickstream myself?
>
> Erik:
>
> > This is actually how our production search ranking is built for around
> the
> top 20 sites by search volume that we host. Simple wikipedia isn't one of
> those we currently use machine ranking for though.
>
> Awesome! Is there more info available somewhere? Algorithms used etc.
> maybe even source code?
>
> > Because of that we do have the data you need, but the problem will be
> that the actual search
> queries are considered PII (Personally Identifiable Information) and not
> something I can release publicly. It may be possible to release aggregated
> data sets that don't include the actual search terms, but at that point I
> don't think the data will be useful to you anymore.
>
> I think I'm fine with query-document pairs. Isn't that sufficiently
> aggregated to not be considered PII?
>
> Thank you!
> Georg
>
>
> Georg Sorst <g.so...@findologic.com> schrieb am Mi., 28. Feb. 2018 um
> 12:17 Uhr:
>
>> Hi list,
>>
>> as part of a lecture on Information Retrieval I am giving we work a lot
>> with Simple Wikipedia articles. It's a great data set because it's
>> comprehensive and not domain specific so when building search on top of it
>> humans can easily judge result quality, and it's still small enough to be
>> handled by a regular computer.
>>
>> This year I want to cover the topic of Machine Learning for search. The
>> idea is to look at result clicks from an internal search search engine,
>> feed that into the Machine Learning and adjust search accordingly so that
>> the top-clicked results actually rank best. We will be using Solr LTR for
>> this purpose.
>>
>> I would love to base this on Simple Wikipedia data since it would fit
>> well into the rest of the lecture. Unfortunately, I could not find that
>> data. The closest I came is
>> https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this
>> covers neither Simple Wikipedia nor does it specify internal search queries.
>>
>> Did I miss something? Is this data available somewhere? Can I produce it
>> myself from raw data? Ideally I would need (query-document) pairs with the
>> number of occurrences.
>>
>> Thank you!
>> Georg
>> --
>> *Georg M. Sorst I CTO*
>> [image: FINDOLOGIC Logo]
>>
>> Jakob-Haringer-Str. 5a | 5020
>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>>  Salzburg
>> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>> I T.: +43 662 456708 <+43%20662%20456708>
>> E.: g.so...@findologic.com
>> www.findologic.com Folgen Sie uns auf: XING
>> <https://www.xing.com/profile/Georg_Sorst> facebook
>> <http://www.facebook.com/Findologic/> Twitter
>> <https://twitter.com/findologic>
>>
>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
>> A6 Stand E130 in München*! Hier
>> <berat...@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin
>> vereinbaren!
>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
>> Hier <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin
>> vereinbaren!
>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
>> G.17 in Zürich*! Hier <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> 
>> Termin
>> vereinbaren!
>> Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
>>
> --
> *Georg M. Sorst I CTO*
> [image: FINDOLOGIC Logo]
>
> Jakob-Haringer-Str. 5a | 5020
> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
>  Salzburg
> <https://maps.google.com/?q=Jakob-Haringer-Str.+5a+%7C+5020%C2%A0Salzburg&entry=gmail&source=g>
> I T.: +43 662 456708 <+43%20662%20456708>
> E.: g.so...@findologic.com
> www.findologic.com Folgen Sie uns auf: XING
> <https://www.xing.com/profile/Georg_Sorst> facebook
> <http://www.facebook.com/Findologic/> Twitter
> <https://twitter.com/findologic>
>
> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
> A6 Stand E130 in München*! Hier
> <berat...@findologic.com?subject=Internet%20World%20M%C3%BCnchen> Termin
> vereinbaren!
> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
> <berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin
> vereinbaren!
> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
> in Zürich*! Hier <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin
> vereinbaren!
> Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
>
-- 
*Georg M. Sorst I CTO*
FINDOLOGIC GmbH


Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.so...@findologic.com
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst>facebook
<https://www.facebook.com/Findologic> Twitter
<https://twitter.com/findologic>

Wir sehen uns auf der *SHOPTALK* von 18. bis 21.03 in *Las Vegas*! Hier
<berat...@findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04. in *Halle 7 Stand G.17 in
Zürich*! Hier <berat...@findologic.com?subject=SOM%20Z%C3%BCrich> Termin
vereinbaren!
Wir sehen uns auf dem *SHOPWARE Community Day* am 18.05.* in Duisburg*! Hier
<berat...@findologic.com?subject=Shopware%20Community%20Day> Termin
vereinbaren!
Wir sehen uns auf der *OXID Commons* am 14.06. *in Freiburg*! Hier
<berat...@findologic.com?subject=OXID%20Commons> Termin vereinbaren!
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to