We also have an experimental set of Parsoid HTML dumps available at
http://dumps.wikimedia.org/htmldumps/dumps/. This is currently a one-off
run, but I do hope that we will be able to run this once per week. Please
see https://phabricator.wikimedia.org/T93396 for more information &
feedback.

Gabriel

On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart <[email protected]>
wrote:

> Dear Gordon
>
> On 25.07.2015 01:38, Gordon Mohr wrote:
>
>> The 2015-05 enwiki nopic dump is a great resource for getting bulk
>> article text – much better in my experience than using scripts that
>> try to strip it out of XML dumps, or wrestling with a full MW+Parsoid
>> system.
>>
>
> Thank you. You use it for a research purpose?
>
> I see threads from earlier in the year that the goal is monthly ZIM dumps.
>>
>> Any projections for when that might be achieved, or perhaps just when
>> the process that succeeded in creating the 2015-05 dump(s) might be
>> repeated as another one-off?
>>
>
> Fixing that problem is my top-priority and we are getting better and
> better. Something you can see by yourself if you look at
> http://download.kiwix.org/zim/. Unfortunately we deal with limited
> hardware resources and the software solution to do these snapshots
> (mwoffliner) is still a little bit buggy.
>
> WPEN being the "worse" snapshot to generate, it is also the one which
> suffers the most of these problems.
>
> That said, I think we will achieve full monthly updates in the next months
> and I plan a new snapshot of WPEN in August (anyway).
>
> Kind regards
> Emmanuel
>
> --
> Kiwix - Wikipedia Offline & more
> * Web: http://www.kiwix.org
> * Twitter: https://twitter.com/KiwixOffline
> * more: http://www.kiwix.org/wiki/Communication
>
> _______________________________________________
> Offline-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/offline-l
>



-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Offline-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/offline-l

Reply via email to