Le ven 20/11/09 11:11, "Andy Rabagliati" [email protected] a écrit: > On Mon, 16 Nov 2009, Emmanuel Engelhart wrote: > > > > These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have > been built > > using categories. > > > > This dump is one I have build (maybe extract from > the ZIM)... but a > little bit modified. This a pretty interesting url, > would be great to > know how the dev. behind have done exactly... maybe > you would be able to > do the same. > > This is his explanation :- > > > This collection of articles is called "beta2" because they were > extracted from wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim at > tmp.kiwix.org. My distribution has different file names for all the > articles based on their titles and has a title search capability that > only depends on Javascript. Below is some explanation of the steps I > performed, not necessarily done in this order. > > 1. I extracted the articles from the zim file by downloading, > compiling and running zimDump from openzim.org. Compiling zimDump is > nontrivial because it involves downloading and compiling other > packages in versions that will work together. > > 2. I created three lists with Perl scripts and manual cleaning > afterwards. > > A. The first list was a list of all articles: zim file name, UTF8 > title, and ASCII title. > > B. The second list was a list of all zim files (articles, images and > other files, Javascript and CSS): zim file name and target file name > in my distribution. > > C. The third list was a list for redirecting one zim file name to > another. The zim dump creates a lot of empty files in the A > subdirectory (A contains all the articles). It turns out that each > of them needs to be redirected to another article. The redirects > can be determined by downloading and running the zimReader program > for Linux, which can be found at openzim.org. > > There appear to be a few duplicate articles (none were deleted), which > I list below (in ASCII) for anyone who is interested: > > Abu Rayhan Biruni > 'Alawi > Battle of Mohacs > Beer-Lambert law > Charismatic movement > Elian Gonzalez affair > Ismail Enver > Ismet Inonu > Istiklal Marsi > Izmir Province > Wikipedia:0.7/0.7geo/Leopold > Macapa or Macapai > Maceia or Maceio > Nicole Vaidisova > PRIDE Fighting Championships > War in Afghanistan (2001-present) > > 3. I used a Perl script to copy all the files from the zim dump to a > staging area, modifying the links along the way. There are many, many > dead image links (26314 in my count); I changed those links to empty > strings. There are also some dead article links, most of them > correspond to dead image links, but a few of them should have been > redirected; they got added to my third list above. Here are all the > dead article links and any appropriate redirect for anyone who is > interested. > > A/5ISM ignore > A/35A A/D6N > A/53Z A/CWO > A/5J03 ignore > A/5J55 ignore > A/9XO A/HQO > A/APD A/A35 > A/D07 A/9PW > A/F5G A/163K > A/PRL ignore > A/TKV A/S4X > A/TR4 ignore > A/VBB ignore > A/ZM2 ignore > A/2QE6 ignore > A/T3B A/NQR > A/11B0 ignore > A/1NN2 ignore > A/5ISU A/4O > A/5JAZ ignore > A/5IV3 ignore > A/5IXU ignore > A/102Z ignore > A/1QTM ignore > A/5IOB ignore > A/5J51 ignore > A/5IP6 ignore > A/5JBO ignore > A/Y91 ignore > > 4. In addition to changing links, I made a few other changes. Each > article now has a search box for title search. I took some existing > GPLed Javascript (JSE search engine) and made extensive modifications > for this application. It only searches the titles; there is no > keyword index, and there is no text search. The motto of the code is > "Linear Search FTW". It is surprisingly snappy, though in hindsight, > searching 30000 titles is not a lot for a computer to do. The results > page is functional, but otherwise not too exciting. > > I changed the titles of the index pages to something less geeky, e.g.. > "Topical Index: Wikipedia" for the topic index page on Wikipedia. I > also fixed a number of incorrect links to the topical index page to > alphabetical index page. > > Enjoy, > > Tom Bylander
Thank you very much Andy for having forwarded this email, this is really interesting. That's confirm a few things: * The wikipedia_en_wp1_0.7_30000+_05_2009_beta2 is only a beta and should be improved and I will do it soon. * I need to code a perl script to check many thing in a HTML directory (https://sourceforge.net/tracker/?func=detail&aid=2901059&group_id=175508&atid=873518) before building the ZIM * I think we need such a tool (zim-check?) in C++ coded to be able to do the same with ZIM files (I see that pretty necessary if we want to setup a ZIM Library: nobody want that we spread bad quality ZIM) http://bugs.openzim.org/show_bug.cgi?id=14 As soon as I will have finished with that stuff I will publish a new version of the ZIM file and contact Tom. Regards Emmanuel _______________________________________________ dev-l mailing list [email protected] https://intern.openzim.org/mailman/listinfo/dev-l
