I've used the 2015-05 wp en zim dump to get full-text for experiments
with topic-modeling – specifically the Doc2Vec ("Paragraph Vectors")
algorithm available in the python open-source library 'gensim'.I'll also likely use it to get test/seed text (mostly article abstracts) for my "wiki reference in tiny chunks" project, Thunkpedia. (For prior iterations, I've used either DBpedia long-abstracts or bulk scraping of the HTTP APIs, but I expect grabbing the first-sections of zim-dump articles will dominate those options in every relevant dimension.) There are a lot of hackish scripts floating around for coercing text from XML dumps, but since the zim dump already has semantically-significant templates expanded, it could (and probably should) be the preferred text source for many projects. My code to iterate (or dump) article plain-text will be on github at some point. I see other (so far non-en) 2015-08 dumps starting to appear, so looking forward to the WPEN biggie whenever it arrives. Thanks! - Gordon On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart <[email protected]> wrote: > Dear Gordon > > On 25.07.2015 01:38, Gordon Mohr wrote: >> >> The 2015-05 enwiki nopic dump is a great resource for getting bulk >> article text – much better in my experience than using scripts that >> try to strip it out of XML dumps, or wrestling with a full MW+Parsoid >> system. > > > Thank you. You use it for a research purpose? > >> I see threads from earlier in the year that the goal is monthly ZIM dumps. >> >> Any projections for when that might be achieved, or perhaps just when >> the process that succeeded in creating the 2015-05 dump(s) might be >> repeated as another one-off? > > > Fixing that problem is my top-priority and we are getting better and better. > Something you can see by yourself if you look at > http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware > resources and the software solution to do these snapshots (mwoffliner) is > still a little bit buggy. > > WPEN being the "worse" snapshot to generate, it is also the one which > suffers the most of these problems. > > That said, I think we will achieve full monthly updates in the next months > and I plan a new snapshot of WPEN in August (anyway). > > Kind regards > Emmanuel > > -- > Kiwix - Wikipedia Offline & more > * Web: http://www.kiwix.org > * Twitter: https://twitter.com/KiwixOffline > * more: http://www.kiwix.org/wiki/Communication > > _______________________________________________ > Offline-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/offline-l _______________________________________________ Offline-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/offline-l
