On 02.03.2014 11:08, Emmanuel Engelhart wrote: > Le 02/03/2014 01:33, Samuel Klein a écrit : >> Brilliant. Congrats to everyone who is working on this! >> What is needed to scrape categories? > > 0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), > download the list of categories they belong to (with the MW API). > 1 - For each dumped page, implement the HTML rendering of the category > list at the bottom. > 2 - For each category page, get the content HTML rendering from Parsoid > and compute and render sorted lists of articles and sub-categories in a > similar fashion like the online version (with multiple pages if necessary). > > All the stuff must be integrated in the nodejs script and category graph > must be stored in redis.
what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category? http://openzim.org/wiki/Category_Handling /Manuel -- Wikimedia CH - Verein zur Förderung Freien Wissens Lausanne, +41 (21) 34066-22 - www.wikimedia.ch _______________________________________________ Offline-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/offline-l
