We also have an experimental set of Parsoid HTML dumps available at http://dumps.wikimedia.org/htmldumps/dumps/. This is currently a one-off run, but I do hope that we will be able to run this once per week. Please see https://phabricator.wikimedia.org/T93396 for more information & feedback.
Gabriel On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart <[email protected]> wrote: > Dear Gordon > > On 25.07.2015 01:38, Gordon Mohr wrote: > >> The 2015-05 enwiki nopic dump is a great resource for getting bulk >> article text – much better in my experience than using scripts that >> try to strip it out of XML dumps, or wrestling with a full MW+Parsoid >> system. >> > > Thank you. You use it for a research purpose? > > I see threads from earlier in the year that the goal is monthly ZIM dumps. >> >> Any projections for when that might be achieved, or perhaps just when >> the process that succeeded in creating the 2015-05 dump(s) might be >> repeated as another one-off? >> > > Fixing that problem is my top-priority and we are getting better and > better. Something you can see by yourself if you look at > http://download.kiwix.org/zim/. Unfortunately we deal with limited > hardware resources and the software solution to do these snapshots > (mwoffliner) is still a little bit buggy. > > WPEN being the "worse" snapshot to generate, it is also the one which > suffers the most of these problems. > > That said, I think we will achieve full monthly updates in the next months > and I plan a new snapshot of WPEN in August (anyway). > > Kind regards > Emmanuel > > -- > Kiwix - Wikipedia Offline & more > * Web: http://www.kiwix.org > * Twitter: https://twitter.com/KiwixOffline > * more: http://www.kiwix.org/wiki/Communication > > _______________________________________________ > Offline-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/offline-l > -- Gabriel Wicke Principal Engineer, Wikimedia Foundation
_______________________________________________ Offline-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/offline-l
