Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart Tue, 04 Mar 2014 03:11:26 -0800

Le 04/03/2014 00:01, Manuel Schneider a écrit :
> On 02.03.2014 11:08, Emmanuel Engelhart wrote:
>> Le 02/03/2014 01:33, Samuel Klein a écrit :
>>> Brilliant.  Congrats to everyone who is working on this!
>>> What is needed to scrape categories?
>>
>> 0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages),
>> download the list of categories they belong to (with the MW API).
>> 1 - For each dumped page, implement the HTML rendering of the category
>> list at the bottom.
>> 2 - For each category page, get the content HTML rendering from Parsoid
>> and compute and render sorted lists of articles and sub-categories in a
>> similar fashion like the online version (with multiple pages if necessary).
>>
>> All the stuff must be integrated in the nodejs script and category graph
>> must be stored in redis.
> 
> what about the internal structure inside ZIM which uses category pages
> (like in the wiki) for the text and a list of pointers to the pages
> inside the ZIM file to implement the category?
> 
> http://openzim.org/wiki/Category_Handling


Not sure to 100% understand your question, but it's necessary to store
the category graph as a hash table before compiling everything in a ZIM
file. That's why I talk about redis.

In addition (but this is not mandatory to enjoy the categories), it
would be great to do normalisation & implementation work to store the
category graph in a structured manner and avoid storing the lists in
HTML pages. This is still something we have on the roadmap.

Emmanuel


-- 
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

_______________________________________________
Offline-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/offline-l

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

Reply via email to