Re: [openZIM dev-l] zeno, zim formats

emmanuel Fri, 20 Nov 2009 03:21:09 -0800

 Le ven 20/11/09 11:11, "Andy Rabagliati" [email protected] a écrit:
> On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:
> 
> > > These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have
> been built
> > using categories.
> > 
> > This dump is one I have build (maybe extract from
> the ZIM)... but a
> little bit modified. This a pretty interesting url,
> would be great to
> know how the dev. behind have done exactly... maybe
> you would be able to
> do the same.
> 
> This is his explanation :-
> 
> 
> This collection of articles is called "beta2" because they were
> extracted from wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim at
> tmp.kiwix.org.  My distribution has different file names for all the
> articles based on their titles and has a title search capability that
> only depends on Javascript.  Below is some explanation of the steps I
> performed, not necessarily done in this order.
> 
> 1. I extracted the articles from the zim file by downloading,
> compiling and running zimDump from openzim.org.  Compiling zimDump is
> nontrivial because it involves downloading and compiling other
> packages in versions that will work together.
> 
> 2. I created three lists with Perl scripts and manual cleaning
> afterwards.  
> 
> A. The first list was a list of all articles: zim file name, UTF8
> title, and ASCII title.
> 
> B. The second list was a list of all zim files (articles, images and
> other files, Javascript and CSS): zim file name and target file name
> in my distribution.
> 
> C. The third list was a list for redirecting one zim file name to
> another.  The zim dump creates a lot of empty files in the A
> subdirectory (A contains all the articles).  It turns out that each
> of them needs to be redirected to another article.  The redirects
> can be determined by downloading and running the zimReader program
> for Linux, which can be found at openzim.org.
> 
> There appear to be a few duplicate articles (none were deleted), which
> I list below (in ASCII) for anyone who is interested:
> 
> Abu Rayhan Biruni
> 'Alawi
> Battle of Mohacs
> Beer-Lambert law
> Charismatic movement
> Elian Gonzalez affair
> Ismail Enver
> Ismet Inonu
> Istiklal Marsi
> Izmir Province
> Wikipedia:0.7/0.7geo/Leopold
> Macapa or Macapai
> Maceia or Maceio
> Nicole Vaidisova
> PRIDE Fighting Championships
> War in Afghanistan (2001-present)
> 
> 3. I used a Perl script to copy all the files from the zim dump to a
> staging area, modifying the links along the way.  There are many, many
> dead image links (26314 in my count); I changed those links to empty
> strings.  There are also some dead article links, most of them
> correspond to dead image links, but a few of them should have been
> redirected; they got added to my third list above.  Here are all the
> dead article links and any appropriate redirect for anyone who is
> interested.
> 
> A/5ISM        ignore
> A/35A A/D6N
> A/53Z A/CWO
> A/5J03        ignore
> A/5J55        ignore
> A/9XO A/HQO
> A/APD A/A35
> A/D07 A/9PW
> A/F5G A/163K
> A/PRL ignore
> A/TKV A/S4X
> A/TR4 ignore
> A/VBB ignore
> A/ZM2 ignore
> A/2QE6        ignore
> A/T3B A/NQR
> A/11B0        ignore
> A/1NN2        ignore
> A/5ISU        A/4O
> A/5JAZ        ignore
> A/5IV3        ignore
> A/5IXU        ignore
> A/102Z        ignore
> A/1QTM        ignore
> A/5IOB        ignore
> A/5J51        ignore
> A/5IP6        ignore
> A/5JBO        ignore
> A/Y91 ignore
> 
> 4. In addition to changing links, I made a few other changes.  Each
> article now has a search box for title search.  I took some existing
> GPLed Javascript (JSE search engine) and made extensive modifications
> for this application.  It only searches the titles; there is no
> keyword index, and there is no text search.  The motto of the code is
> "Linear Search FTW".  It is surprisingly snappy, though in hindsight,
> searching 30000 titles is not a lot for a computer to do.  The results
> page is functional, but otherwise not too exciting.
> 
> I changed the titles of the index pages to something less geeky, e.g..
> "Topical Index: Wikipedia" for the topic index page on Wikipedia.  I
> also fixed a number of incorrect links to the topical index page to
> alphabetical index page.
> 
> Enjoy,
> 
> Tom Bylander


Thank you very much Andy for having forwarded this email, this is really 
interesting.

That's confirm a few things:
* The wikipedia_en_wp1_0.7_30000+_05_2009_beta2 is only a beta and should be 
improved and I will do it soon.
* I need to code a perl script to check many thing in a HTML directory 
(https://sourceforge.net/tracker/?func=detail&aid=2901059&group_id=175508&atid=873518)
 before building the ZIM
* I think we need such a tool (zim-check?) in C++ coded to be able to do the 
same with ZIM files (I see that pretty necessary if we want to setup a ZIM 
Library: nobody want that we spread bad quality ZIM) 
http://bugs.openzim.org/show_bug.cgi?id=14

As soon as I will have finished with that stuff I will publish a new version of 
the ZIM file and contact Tom.

Regards
Emmanuel

_______________________________________________
dev-l mailing list
[email protected]
https://intern.openzim.org/mailman/listinfo/dev-l

Re: [openZIM dev-l] zeno, zim formats

Reply via email to