I've used the 2015-05 wp en zim dump to get full-text for experiments
with topic-modeling – specifically the Doc2Vec ("Paragraph Vectors")
algorithm available in the python open-source library 'gensim'.

I'll also likely use it to get test/seed text (mostly article
abstracts) for my "wiki reference in tiny chunks" project, Thunkpedia.
(For prior iterations, I've used either DBpedia long-abstracts or bulk
scraping of the HTTP APIs, but I expect grabbing the first-sections of
zim-dump articles will dominate those options in every relevant
dimension.)

There are a lot of hackish scripts floating around for coercing text
from XML dumps, but since the zim dump already has
semantically-significant templates expanded, it could (and probably
should) be the preferred text source for many projects. My code to
iterate (or dump) article plain-text will be on github at some point.

I see other (so far non-en) 2015-08 dumps starting to appear, so
looking forward to the WPEN biggie whenever it arrives. Thanks!

- Gordon

On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart <[email protected]> wrote:
> Dear Gordon
>
> On 25.07.2015 01:38, Gordon Mohr wrote:
>>
>> The 2015-05 enwiki nopic dump is a great resource for getting bulk
>> article text – much better in my experience than using scripts that
>> try to strip it out of XML dumps, or wrestling with a full MW+Parsoid
>> system.
>
>
> Thank you. You use it for a research purpose?
>
>> I see threads from earlier in the year that the goal is monthly ZIM dumps.
>>
>> Any projections for when that might be achieved, or perhaps just when
>> the process that succeeded in creating the 2015-05 dump(s) might be
>> repeated as another one-off?
>
>
> Fixing that problem is my top-priority and we are getting better and better.
> Something you can see by yourself if you look at
> http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware
> resources and the software solution to do these snapshots (mwoffliner) is
> still a little bit buggy.
>
> WPEN being the "worse" snapshot to generate, it is also the one which
> suffers the most of these problems.
>
> That said, I think we will achieve full monthly updates in the next months
> and I plan a new snapshot of WPEN in August (anyway).
>
> Kind regards
> Emmanuel
>
> --
> Kiwix - Wikipedia Offline & more
> * Web: http://www.kiwix.org
> * Twitter: https://twitter.com/KiwixOffline
> * more: http://www.kiwix.org/wiki/Communication
>
> _______________________________________________
> Offline-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/offline-l

_______________________________________________
Offline-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/offline-l

Reply via email to