Le 04/03/2014 00:01, Manuel Schneider a écrit : > On 02.03.2014 11:08, Emmanuel Engelhart wrote: >> Le 02/03/2014 01:33, Samuel Klein a écrit : >>> Brilliant. Congrats to everyone who is working on this! >>> What is needed to scrape categories? >> >> 0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), >> download the list of categories they belong to (with the MW API). >> 1 - For each dumped page, implement the HTML rendering of the category >> list at the bottom. >> 2 - For each category page, get the content HTML rendering from Parsoid >> and compute and render sorted lists of articles and sub-categories in a >> similar fashion like the online version (with multiple pages if necessary). >> >> All the stuff must be integrated in the nodejs script and category graph >> must be stored in redis. > > what about the internal structure inside ZIM which uses category pages > (like in the wiki) for the text and a list of pointers to the pages > inside the ZIM file to implement the category? > > http://openzim.org/wiki/Category_Handling
Not sure to 100% understand your question, but it's necessary to store the category graph as a hash table before compiling everything in a ZIM file. That's why I talk about redis. In addition (but this is not mandatory to enjoy the categories), it would be great to do normalisation & implementation work to store the category graph in a structured manner and avoid storing the lists in HTML pages. This is still something we have on the roadmap. Emmanuel -- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication _______________________________________________ Offline-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/offline-l
