Hi Jeff,

That's already on Figshare, too, thanks to Tiziano:

https://figshare.com/articles/WCNPruning_input_set/6157445
category_graph_sept17.tsv.gz --> the category network without cycles
article_categories_sept17.tsv.gz --> mapping article-category

Best,
Leila


On Wed, May 23, 2018 at 5:49 PM, Jeffrey Levesque <[email protected]> wrote:
> Hi Leila,
> Again thank you very much 😊 If you don't mind providing the DAG snapshot that 
> would be great! This way we can try to replicate the steps you have outlined, 
> and make sure we are doing things correctly.
>
> Thank you!
> Jeff Levesque
>
> -----Original Message-----
> From: Leila Zia <[email protected]>
> Sent: Wednesday, May 23, 2018 8:30 PM
> To: Jeffrey Levesque <[email protected]>
> Cc: Wikimedia Answers <[email protected]>; A mailing list for the 
> Analytics Team at WMF and everybody who has an interest in Wikipedia and 
> analytics. <[email protected]>; Corey Jackson Jr 
> <[email protected]>; Jesse Warren <[email protected]>
> Subject: Re: Jeff Levesque: List of Articles By Categories (College Project)
>
> Hi Jeff and team,
>
> On Wed, May 23, 2018 at 4:57 PM, Jeffrey Levesque <[email protected]> wrote:
>> Hi Leila,
>> I was hoping to try predict what categories of articles viewers would read:
>>
>> •       https://en.wikipedia.org/wiki/Category:Main_topic_classifications
>>
>> But, I realized that Wikipedia categories doesn't have a well-defined 
>> structure. For example, I think it's possible that articles could have a 
>> recursive chaining of categories (a subcategory could have many parent 
>> categories, and may continue indefinitely). So, it seems impossible to 
>> derive the idea of a "main category".  I was previously hoping that if it 
>> was possible to derive a "main category", I could extend the findings, by 
>> relating it to current socio-political events. To meet my course 
>> requirements, I may have to adjust our project idea. However, if you have 
>> possible (maybe related insights / strategies), that would be very 
>> appreciated.
>
> Ok. So we have some things for you:
>
> * Check section 4.3. of https://arxiv.org/pdf/1804.05995.pdf . There we 
> describe a way to clean the category network. What you will get there is a 
> series of DAGs where cycles are removed and the relations are is-a.
>
> * We have a research showcase presentation on the above, if that
> helps: First presentation, goes for ~30min 
> https://www.youtube.com/watch?v=ACevHs0sMMw
>
> * The code for removing cycles is at
> https://github.com/epfl-dlab/GraphCyclesRemoval
>
> * The code for the pruning method is at 
> https://github.com/epfl-dlab/WCNPruning
>
> * We have done a (silent;) release of the data-set of the paper at
> https://figshare.com/articles/Structuring_Wikipedia_Articles_with_Section_Recommendations/6157583
> .
>
> If you want the already cleaned category network in the form of DAGs based on 
> a snapshot in 2017 (and if it's already not in these links, I'm blanking 
> now), we should be able to send it your way. Just say it.
>
> If the category prediction becomes too hairy and if you have more than a week 
> time left, ;) ping and I'd be happy to brainstorm about what other questions 
> you can consider. (One thing that comes to mind is:
> characterizing articles, let's say in English Wikipedia, that have not been 
> read often in the past six months, and if you have time, contrasting it those 
> that have been read often.)
>
>> Also, thank you very much for taking the time to respond to me!
>
> No worries. :)
>
> Good luck! This class of yours sounds really exciting.
>
> Leila
>
>>
>> Thank you,
>> Jeff Levesque
>>
>> -----Original Message-----
>> From: Leila Zia <[email protected]>
>> Sent: Wednesday, May 23, 2018 7:34 PM
>> To: Jeffrey Levesque <[email protected]>
>> Cc: Wikimedia Answers <[email protected]>; A mailing list for the
>> Analytics Team at WMF and everybody who has an interest in Wikipedia
>> and analytics. <[email protected]>
>> Subject: Re: Jeff Levesque: List of Articles By Categories (College
>> Project)
>>
>> + Analytics, our public analytics related mailing list [1]
>>
>> Hi Jeff,
>>
>> Let me give it a try:
>>
>> * Re pageviews: a lot has changed since the Kaggle contest days you
>> refer to. :) I highly recommend you check out
>> https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly
>> pageviews per article live. In case you need it, abbreviations used in
>> the file names are documented. [2]
>>
>> * Can you expand more what you are trying to do? The short answer for your 
>> category related question is that you have to parse XML dumps, but we may 
>> have some good pointers for you to save you from that. If you tell us more, 
>> we're more likely to be able to help.
>>
>> * And, if you decide to continue research on Wiki(m|p)edia data (which
>> I hope you do:), consider signing up in our public research list at
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> Best,
>> Leila
>>
>> [1] https://lists.wikimedia.org/mailman/listinfo/analytics
>> [2]
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagevi
>> ews
>>
>> --
>> Leila Zia
>> Senior Research Scientist, Lead
>> Wikimedia Foundation
>>
>>
>> On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers <[email protected]> 
>> wrote:
>>> Forwarding for your evaluation :) Feel free to include the wider
>>> Research team.
>>>
>>> best,
>>> Joe
>>>
>>> ---------- Forwarded message ----------
>>> From: Jeffrey Levesque <[email protected]>
>>> Date: Tue, May 22, 2018 at 7:48 AM
>>> Subject: Re: Jeff Levesque: List of Articles By Categories (College
>>> Project)
>>> To: "[email protected]" <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>>
>>>
>>> Hi,
>>> Is there a known API, where I can supply the article name, and attain
>>> the corresponding "category" the article belongs to? I'm thinking I
>>> could write a python script and iterate the kaggle dataset, then send
>>> some POST request to hopefully some existing API, to determine the articles 
>>> "category".
>>>
>>> Thank you,
>>>
>>> Jeff Levesque
>>> https://github.com/jeff1evesque
>>>
>>> On May 22, 2018, at 10:37 AM, Jeffrey Levesque <[email protected]> wrote:
>>>
>>> Hi,
>>> Do you guys have a more recent time series of Wikipedia article
>>> traffic. I'm noticing that the kaggle dataset does not have a lot of
>>> articles that are on Wikipedia. Do you guys have a good idea of how I
>>> can categorize the dataset I have?
>>>
>>> Thank you,
>>>
>>> Jeff Levesque
>>> https://github.com/jeff1evesque
>>>
>>> On May 22, 2018, at 8:40 AM, Jeffrey Levesque <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I am masters student at Syracuse University. For my data science
>>> class, I am doing a project trying to analyze traffic patterns for
>>> Wikipedia. I’ve attained the Kaggle dataset for 2015-2016 data:
>>>
>>>
>>>
>>> https://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration
>>> -
>>> wtf-eda/data
>>>
>>>
>>>
>>> However, the dataset only provides the frequency of visits to
>>> particular pages on a given day. Could I request to attain a list of
>>> articles grouped by ā€œCategoriesā€? I’ve tried to use the API (i.e.
>>> https://en.wikipedia.org/wiki/Special:Export). But, that doesn’t seem
>>> to generate a full output. Additionally, in the list it supplies 
>>> subcategories.
>>> So, I tried using the URL API (i.e.
>>> https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&format=json).
>>> But, that also seems to return an even shorter result set:
>>>
>>>
>>>
>>> {"batchcomplete":"","continue":{"cmcontinue":"page|2d2941313f2b292d3d
>>> 0
>>> 447454f31434f39293f011701dc16|55503653","continue":"-||"},"query":{"c
>>> 447454f31434f39293f011701dc16|a
>>> tegorymembers":[{"pageid":22939,"ns":0,"title":"Physics"},{"pageid":2
>>> 4
>>> 489,"ns":0,"title":"Outline of
>>> physics"},{"pageid":3445246,"ns":0,"title":"Glossary of classical
>>> physics"},{"pageid":1653925,"ns":100,"title":"Portal:Physics"},{"page
>>> i
>>> d":50926902,"ns":0,"title":"Action
>>> angle
>>> coordinates"},{"pageid":9079863,"ns":0,"title":"Aerometer"},{"pageid":
>>> 52657328,"ns":0,"title":"Bayesian model of computational
>>> anatomy"},{"pageid":49342572,"ns":0,"title":"Group
>>> actions in computational
>>> anatomy"},{"pageid":50724262,"ns":0,"title":"Blasius\u2013Chaplygin
>>> formula"},{"pageid":33327002,"ns":0,"title":"Cabbeling"}]}}
>>>
>>>
>>>
>>>
>>>
>>> Thank you,
>>>
>>> Jeff Levesque
>>>
>>> (603) 969-5363
>>>
>>>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to