Brilliant.  Congrats to everyone who is working on this!
What is needed to scrape categories?

On Sat, Mar 1, 2014 at 12:01 PM, Emmanuel Engelhart <[email protected]> wrote:
> Hi
>
> For the first time, we have achieved to release a complete dump of all
> encyclopedic articles of the Wikipedia in English, *with thumbnails*.
>
> This ZIM file is 40 GB big and contains the current 4.5 million articles
> with their 3.5 millions pictures:
> http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
>
> This ZIM file is directly and easily usable on many types of devices
> like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian
> with Wikionboard.
>
> You don't need modern computers with big CPUs. You can for example
> create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by
> using our ZIM dedicated Web server called kiwix-serve. A demo is
> available here: http://library.kiwix.org/wikipedia_en_all/
>
> Like always, we also provide a packaged version (for the main PC
> systems) which includes fulltext search index+ZIM file+binaries:
> http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
>
> What is interesting too: This file was generated in less than 2 weeks
> thanks to multiples recent innovations:
> * The Parsoid (cluster), which gives us an HTML output with additional
> semantic RDF tags
> * mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
> API (and Parsoid API)
> * zimwriterfs, a solution able to compile any local HTML directory to a
> ZIM file
>
> We have now an efficient way to generate new ZIM files. Consequently, we
> will work to industrialize and automatize the ZIM file generation
> process, one thing which is probably the most oldest and important
> problem we still face at Kiwix.
>
> All this would not have been possible without the support:
> * Wikimedia CH and the "ZIM autobuild" project
> * Wikimedia France and the Afripedia project
> * Gwicke from the WMF Parsoid dev team.
>
> BTW, we need additional developer helps with javascript/nodejs skills to
> fix a few issues on mwoffliner:
> * Recreate the "table of content" based on the HTML DOM (*)
> * Scrape Mediawiki Resourceloader in a manner it will continue to work
> offline (***)
> * Scrape categories (**)
> * Localized the script (*)
> * Improve the global performance by introducing usage of workers (**)
> * Create nodezim, the libzim nodejs binding and use it (***, need also
> compilation and C++ skills)
> * Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)
>
> Emmanuel
> --
> Kiwix - Wikipedia Offline & more
> * Web: http://www.kiwix.org
> * Twitter: https://twitter.com/KiwixOffline
> * more: http://www.kiwix.org/wiki/Communication
>
> _______________________________________________
> Offline-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/offline-l



-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266

_______________________________________________
Offline-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/offline-l

Reply via email to