On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<j16sdiz+free...@gmail.com> wrote: > On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<eva...@gmail.com> wrote: >> On my (incomplete) spider index, the index file for the word "the" (it >> indexes no other words) is 17MB. This seems rather large. It might >> make sense to have the spider not even bother creating an index on a >> handful of very common words (the, be, to, of, and, a, in, I, etc). >> Of course, this presents the occasional difficulty: >> http://bash.org/?514353 I think I'm in favor of not indexing common >> words even so. > > Yes, it should ignore common words. > This is called "stopword" in search engine termology. > >> >> Also, on a related note, the index splitting policy should be a bit >> more sophisticated: in an attempt to fit within the max index size as >> configured, it split all the way down to index_8fc42.xml. As a >> result, the file index_8fc4b.xml sits all by itself at 3KiB. It >> contains the two words "vergessene" and "txjmnsm". I suspect it would >> have reliability issues should anyone actually want to search either >> of those. It would make more sense to have all of index_8fc4 in one >> file, since it would be only trivially larger. (I have a patch that I >> thought did that, but it has a bug; I'll test once my indexwriter is >> finished writing, since I don't want to interrupt it by reloading the >> plugin.) > > "trivially larger" ... > ugh... how trivial is trivial? > > the xmllibrarian can handle index_8fc42.xml on its own but all other > 8fc4 on index_8fc4.xml. > however, as i have stated in irc, that make index generation even slower.
8fc42 is 17382 KiB. All other 8fc4 are 79 KiB combined. Also, it would make index generation faster. The spider first does all the work of creating 8fc4, then discards it to recreate the sub-indexes. The vast majority of this work is in 8fc42, which gets created twice. Not splitting the index would nearly halve the time to create the 8fc4 set of indexes. Of course, a more efficient algorithm for creating the indexes in the first place would both make it far faster and make the two take approximately the same time. Evan Daniel _______________________________________________ Devl mailing list Devl@freenetproject.org http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl