Hi Jeff, That's already on Figshare, too, thanks to Tiziano:
https://figshare.com/articles/WCNPruning_input_set/6157445 category_graph_sept17.tsv.gz --> the category network without cycles article_categories_sept17.tsv.gz --> mapping article-category Best, Leila On Wed, May 23, 2018 at 5:49 PM, Jeffrey Levesque <[email protected]> wrote: > Hi Leila, > Again thank you very much š If you don't mind providing the DAG snapshot that > would be great! This way we can try to replicate the steps you have outlined, > and make sure we are doing things correctly. > > Thank you! > Jeff Levesque > > -----Original Message----- > From: Leila Zia <[email protected]> > Sent: Wednesday, May 23, 2018 8:30 PM > To: Jeffrey Levesque <[email protected]> > Cc: Wikimedia Answers <[email protected]>; A mailing list for the > Analytics Team at WMF and everybody who has an interest in Wikipedia and > analytics. <[email protected]>; Corey Jackson Jr > <[email protected]>; Jesse Warren <[email protected]> > Subject: Re: Jeff Levesque: List of Articles By Categories (College Project) > > Hi Jeff and team, > > On Wed, May 23, 2018 at 4:57 PM, Jeffrey Levesque <[email protected]> wrote: >> Hi Leila, >> I was hoping to try predict what categories of articles viewers would read: >> >> ⢠https://en.wikipedia.org/wiki/Category:Main_topic_classifications >> >> But, I realized that Wikipedia categories doesn't have a well-defined >> structure. For example, I think it's possible that articles could have a >> recursive chaining of categories (a subcategory could have many parent >> categories, and may continue indefinitely). So, it seems impossible to >> derive the idea of a "main category". I was previously hoping that if it >> was possible to derive a "main category", I could extend the findings, by >> relating it to current socio-political events. To meet my course >> requirements, I may have to adjust our project idea. However, if you have >> possible (maybe related insights / strategies), that would be very >> appreciated. > > Ok. So we have some things for you: > > * Check section 4.3. of https://arxiv.org/pdf/1804.05995.pdf . There we > describe a way to clean the category network. What you will get there is a > series of DAGs where cycles are removed and the relations are is-a. > > * We have a research showcase presentation on the above, if that > helps: First presentation, goes for ~30min > https://www.youtube.com/watch?v=ACevHs0sMMw > > * The code for removing cycles is at > https://github.com/epfl-dlab/GraphCyclesRemoval > > * The code for the pruning method is at > https://github.com/epfl-dlab/WCNPruning > > * We have done a (silent;) release of the data-set of the paper at > https://figshare.com/articles/Structuring_Wikipedia_Articles_with_Section_Recommendations/6157583 > . > > If you want the already cleaned category network in the form of DAGs based on > a snapshot in 2017 (and if it's already not in these links, I'm blanking > now), we should be able to send it your way. Just say it. > > If the category prediction becomes too hairy and if you have more than a week > time left, ;) ping and I'd be happy to brainstorm about what other questions > you can consider. (One thing that comes to mind is: > characterizing articles, let's say in English Wikipedia, that have not been > read often in the past six months, and if you have time, contrasting it those > that have been read often.) > >> Also, thank you very much for taking the time to respond to me! > > No worries. :) > > Good luck! This class of yours sounds really exciting. > > Leila > >> >> Thank you, >> Jeff Levesque >> >> -----Original Message----- >> From: Leila Zia <[email protected]> >> Sent: Wednesday, May 23, 2018 7:34 PM >> To: Jeffrey Levesque <[email protected]> >> Cc: Wikimedia Answers <[email protected]>; A mailing list for the >> Analytics Team at WMF and everybody who has an interest in Wikipedia >> and analytics. <[email protected]> >> Subject: Re: Jeff Levesque: List of Articles By Categories (College >> Project) >> >> + Analytics, our public analytics related mailing list [1] >> >> Hi Jeff, >> >> Let me give it a try: >> >> * Re pageviews: a lot has changed since the Kaggle contest days you >> refer to. :) I highly recommend you check out >> https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly >> pageviews per article live. In case you need it, abbreviations used in >> the file names are documented. [2] >> >> * Can you expand more what you are trying to do? The short answer for your >> category related question is that you have to parse XML dumps, but we may >> have some good pointers for you to save you from that. If you tell us more, >> we're more likely to be able to help. >> >> * And, if you decide to continue research on Wiki(m|p)edia data (which >> I hope you do:), consider signing up in our public research list at >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> Best, >> Leila >> >> [1] https://lists.wikimedia.org/mailman/listinfo/analytics >> [2] >> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagevi >> ews >> >> -- >> Leila Zia >> Senior Research Scientist, Lead >> Wikimedia Foundation >> >> >> On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers <[email protected]> >> wrote: >>> Forwarding for your evaluation :) Feel free to include the wider >>> Research team. >>> >>> best, >>> Joe >>> >>> ---------- Forwarded message ---------- >>> From: Jeffrey Levesque <[email protected]> >>> Date: Tue, May 22, 2018 at 7:48 AM >>> Subject: Re: Jeff Levesque: List of Articles By Categories (College >>> Project) >>> To: "[email protected]" <[email protected]> >>> Cc: "[email protected]" <[email protected]> >>> >>> >>> Hi, >>> Is there a known API, where I can supply the article name, and attain >>> the corresponding "category" the article belongs to? I'm thinking I >>> could write a python script and iterate the kaggle dataset, then send >>> some POST request to hopefully some existing API, to determine the articles >>> "category". >>> >>> Thank you, >>> >>> Jeff Levesque >>> https://github.com/jeff1evesque >>> >>> On May 22, 2018, at 10:37 AM, Jeffrey Levesque <[email protected]> wrote: >>> >>> Hi, >>> Do you guys have a more recent time series of Wikipedia article >>> traffic. I'm noticing that the kaggle dataset does not have a lot of >>> articles that are on Wikipedia. Do you guys have a good idea of how I >>> can categorize the dataset I have? >>> >>> Thank you, >>> >>> Jeff Levesque >>> https://github.com/jeff1evesque >>> >>> On May 22, 2018, at 8:40 AM, Jeffrey Levesque <[email protected]> wrote: >>> >>> Hi, >>> >>> I am masters student at Syracuse University. For my data science >>> class, I am doing a project trying to analyze traffic patterns for >>> Wikipedia. Iāve attained the Kaggle dataset for 2015-2016 data: >>> >>> >>> >>> https://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration >>> - >>> wtf-eda/data >>> >>> >>> >>> However, the dataset only provides the frequency of visits to >>> particular pages on a given day. Could I request to attain a list of >>> articles grouped by āCategoriesā? Iāve tried to use the API (i.e. >>> https://en.wikipedia.org/wiki/Special:Export). But, that doesnāt seem >>> to generate a full output. Additionally, in the list it supplies >>> subcategories. >>> So, I tried using the URL API (i.e. >>> https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&format=json). >>> But, that also seems to return an even shorter result set: >>> >>> >>> >>> {"batchcomplete":"","continue":{"cmcontinue":"page|2d2941313f2b292d3d >>> 0 >>> 447454f31434f39293f011701dc16|55503653","continue":"-||"},"query":{"c >>> 447454f31434f39293f011701dc16|a >>> tegorymembers":[{"pageid":22939,"ns":0,"title":"Physics"},{"pageid":2 >>> 4 >>> 489,"ns":0,"title":"Outline of >>> physics"},{"pageid":3445246,"ns":0,"title":"Glossary of classical >>> physics"},{"pageid":1653925,"ns":100,"title":"Portal:Physics"},{"page >>> i >>> d":50926902,"ns":0,"title":"Action >>> angle >>> coordinates"},{"pageid":9079863,"ns":0,"title":"Aerometer"},{"pageid": >>> 52657328,"ns":0,"title":"Bayesian model of computational >>> anatomy"},{"pageid":49342572,"ns":0,"title":"Group >>> actions in computational >>> anatomy"},{"pageid":50724262,"ns":0,"title":"Blasius\u2013Chaplygin >>> formula"},{"pageid":33327002,"ns":0,"title":"Cabbeling"}]}} >>> >>> >>> >>> >>> >>> Thank you, >>> >>> Jeff Levesque >>> >>> (603) 969-5363 >>> >>> _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
