Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Manuel Schneider Mon, 03 Mar 2014 15:02:00 -0800

On 02.03.2014 11:08, Emmanuel Engelhart wrote:
> Le 02/03/2014 01:33, Samuel Klein a écrit :
>> Brilliant.  Congrats to everyone who is working on this!
>> What is needed to scrape categories?
> 
> 0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages),
> download the list of categories they belong to (with the MW API).
> 1 - For each dumped page, implement the HTML rendering of the category
> list at the bottom.
> 2 - For each category page, get the content HTML rendering from Parsoid
> and compute and render sorted lists of articles and sub-categories in a
> similar fashion like the online version (with multiple pages if necessary).
> 
> All the stuff must be integrated in the nodejs script and category graph
> must be stored in redis.


what about the internal structure inside ZIM which uses category pages
(like in the wiki) for the text and a list of pointers to the pages
inside the ZIM file to implement the category?

http://openzim.org/wiki/Category_Handling

/Manuel

-- 
Wikimedia CH - Verein zur Förderung Freien Wissens
Lausanne, +41 (21) 34066-22 - www.wikimedia.ch

_______________________________________________
Offline-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/offline-l

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Reply via email to