Hi Leila,
Again thank you very much šŸ˜Š If you don't mind providing the DAG snapshot that 
would be great! This way we can try to replicate the steps you have outlined, 
and make sure we are doing things correctly.

Thank you!
Jeff Levesque

-----Original Message-----
From: Leila Zia <le...@wikimedia.org> 
Sent: Wednesday, May 23, 2018 8:30 PM
To: Jeffrey Levesque <jleve...@syr.edu>
Cc: Wikimedia Answers <answ...@wikimedia.org>; A mailing list for the Analytics 
Team at WMF and everybody who has an interest in Wikipedia and analytics. 
<analytics@lists.wikimedia.org>; Corey Jackson Jr <cjack...@syr.edu>; Jesse 
Warren <jwarr...@syr.edu>
Subject: Re: Jeff Levesque: List of Articles By Categories (College Project)

Hi Jeff and team,

On Wed, May 23, 2018 at 4:57 PM, Jeffrey Levesque <jleve...@syr.edu> wrote:
> Hi Leila,
> I was hoping to try predict what categories of articles viewers would read:
>
> ā€¢       https://en.wikipedia.org/wiki/Category:Main_topic_classifications
>
> But, I realized that Wikipedia categories doesn't have a well-defined 
> structure. For example, I think it's possible that articles could have a 
> recursive chaining of categories (a subcategory could have many parent 
> categories, and may continue indefinitely). So, it seems impossible to derive 
> the idea of a "main category".  I was previously hoping that if it was 
> possible to derive a "main category", I could extend the findings, by 
> relating it to current socio-political events. To meet my course 
> requirements, I may have to adjust our project idea. However, if you have 
> possible (maybe related insights / strategies), that would be very 
> appreciated.

Ok. So we have some things for you:

* Check section 4.3. of https://arxiv.org/pdf/1804.05995.pdf . There we 
describe a way to clean the category network. What you will get there is a 
series of DAGs where cycles are removed and the relations are is-a.

* We have a research showcase presentation on the above, if that
helps: First presentation, goes for ~30min 
https://www.youtube.com/watch?v=ACevHs0sMMw

* The code for removing cycles is at
https://github.com/epfl-dlab/GraphCyclesRemoval

* The code for the pruning method is at https://github.com/epfl-dlab/WCNPruning

* We have done a (silent;) release of the data-set of the paper at
https://figshare.com/articles/Structuring_Wikipedia_Articles_with_Section_Recommendations/6157583
.

If you want the already cleaned category network in the form of DAGs based on a 
snapshot in 2017 (and if it's already not in these links, I'm blanking now), we 
should be able to send it your way. Just say it.

If the category prediction becomes too hairy and if you have more than a week 
time left, ;) ping and I'd be happy to brainstorm about what other questions 
you can consider. (One thing that comes to mind is:
characterizing articles, let's say in English Wikipedia, that have not been 
read often in the past six months, and if you have time, contrasting it those 
that have been read often.)

> Also, thank you very much for taking the time to respond to me!

No worries. :)

Good luck! This class of yours sounds really exciting.

Leila

>
> Thank you,
> Jeff Levesque
>
> -----Original Message-----
> From: Leila Zia <le...@wikimedia.org>
> Sent: Wednesday, May 23, 2018 7:34 PM
> To: Jeffrey Levesque <jleve...@syr.edu>
> Cc: Wikimedia Answers <answ...@wikimedia.org>; A mailing list for the 
> Analytics Team at WMF and everybody who has an interest in Wikipedia 
> and analytics. <analytics@lists.wikimedia.org>
> Subject: Re: Jeff Levesque: List of Articles By Categories (College 
> Project)
>
> + Analytics, our public analytics related mailing list [1]
>
> Hi Jeff,
>
> Let me give it a try:
>
> * Re pageviews: a lot has changed since the Kaggle contest days you 
> refer to. :) I highly recommend you check out 
> https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly 
> pageviews per article live. In case you need it, abbreviations used in 
> the file names are documented. [2]
>
> * Can you expand more what you are trying to do? The short answer for your 
> category related question is that you have to parse XML dumps, but we may 
> have some good pointers for you to save you from that. If you tell us more, 
> we're more likely to be able to help.
>
> * And, if you decide to continue research on Wiki(m|p)edia data (which 
> I hope you do:), consider signing up in our public research list at 
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> Best,
> Leila
>
> [1] https://lists.wikimedia.org/mailman/listinfo/analytics
> [2] 
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagevi
> ews
>
> --
> Leila Zia
> Senior Research Scientist, Lead
> Wikimedia Foundation
>
>
> On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers <answ...@wikimedia.org> 
> wrote:
>> Forwarding for your evaluation :) Feel free to include the wider 
>> Research team.
>>
>> best,
>> Joe
>>
>> ---------- Forwarded message ----------
>> From: Jeffrey Levesque <jleve...@syr.edu>
>> Date: Tue, May 22, 2018 at 7:48 AM
>> Subject: Re: Jeff Levesque: List of Articles By Categories (College
>> Project)
>> To: "info...@wikimedia.org" <info...@wikimedia.org>
>> Cc: "answ...@wikimedia.org" <answ...@wikimedia.org>
>>
>>
>> Hi,
>> Is there a known API, where I can supply the article name, and attain 
>> the corresponding "category" the article belongs to? I'm thinking I 
>> could write a python script and iterate the kaggle dataset, then send 
>> some POST request to hopefully some existing API, to determine the articles 
>> "category".
>>
>> Thank you,
>>
>> Jeff Levesque
>> https://github.com/jeff1evesque
>>
>> On May 22, 2018, at 10:37 AM, Jeffrey Levesque <jleve...@syr.edu> wrote:
>>
>> Hi,
>> Do you guys have a more recent time series of Wikipedia article 
>> traffic. I'm noticing that the kaggle dataset does not have a lot of 
>> articles that are on Wikipedia. Do you guys have a good idea of how I 
>> can categorize the dataset I have?
>>
>> Thank you,
>>
>> Jeff Levesque
>> https://github.com/jeff1evesque
>>
>> On May 22, 2018, at 8:40 AM, Jeffrey Levesque <jleve...@syr.edu> wrote:
>>
>> Hi,
>>
>> I am masters student at Syracuse University. For my data science 
>> class, I am doing a project trying to analyze traffic patterns for 
>> Wikipedia. Iā€™ve attained the Kaggle dataset for 2015-2016 data:
>>
>>
>>
>> https://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration
>> -
>> wtf-eda/data
>>
>>
>>
>> However, the dataset only provides the frequency of visits to 
>> particular pages on a given day. Could I request to attain a list of 
>> articles grouped by ā€œCategoriesā€? Iā€™ve tried to use the API (i.e.
>> https://en.wikipedia.org/wiki/Special:Export). But, that doesnā€™t seem 
>> to generate a full output. Additionally, in the list it supplies 
>> subcategories.
>> So, I tried using the URL API (i.e.
>> https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&format=json).
>> But, that also seems to return an even shorter result set:
>>
>>
>>
>> {"batchcomplete":"","continue":{"cmcontinue":"page|2d2941313f2b292d3d
>> 0
>> 447454f31434f39293f011701dc16|55503653","continue":"-||"},"query":{"c
>> 447454f31434f39293f011701dc16|a
>> tegorymembers":[{"pageid":22939,"ns":0,"title":"Physics"},{"pageid":2
>> 4
>> 489,"ns":0,"title":"Outline of
>> physics"},{"pageid":3445246,"ns":0,"title":"Glossary of classical 
>> physics"},{"pageid":1653925,"ns":100,"title":"Portal:Physics"},{"page
>> i
>> d":50926902,"ns":0,"title":"Action
>> angle
>> coordinates"},{"pageid":9079863,"ns":0,"title":"Aerometer"},{"pageid":
>> 52657328,"ns":0,"title":"Bayesian model of computational 
>> anatomy"},{"pageid":49342572,"ns":0,"title":"Group
>> actions in computational
>> anatomy"},{"pageid":50724262,"ns":0,"title":"Blasius\u2013Chaplygin
>> formula"},{"pageid":33327002,"ns":0,"title":"Cabbeling"}]}}
>>
>>
>>
>>
>>
>> Thank you,
>>
>> Jeff Levesque
>>
>> (603) 969-5363
>>
>>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to