On 10/6/2009 18:04, Matthew Toseland wrote: > On Wednesday 10 June 2009 06:54:03 Daniel Cheng wrote: >> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote: >>> On my (incomplete) spider index, the index file for the word "the" (it >>> indexes no other words) is 17MB. This seems rather large. It might >>> make sense to have the spider not even bother creating an index on a >>> handful of very common words (the, be, to, of, and, a, in, I, etc). >>> Of course, this presents the occasional difficulty: >>> http://bash.org/?514353 I think I'm in favor of not indexing common >>> words even so. >> Yes, it should ignore common words. >> This is called "stopword" in search engine termology. >> >>> Also, on a related note, the index splitting policy should be a bit >>> more sophisticated: in an attempt to fit within the max index size as >>> configured, it split all the way down to index_8fc42.xml. As a >>> result, the file index_8fc4b.xml sits all by itself at 3KiB. It >>> contains the two words "vergessene" and "txjmnsm". I suspect it would >>> have reliability issues should anyone actually want to search either >>> of those. It would make more sense to have all of index_8fc4 in one >>> file, since it would be only trivially larger. (I have a patch that I >>> thought did that, but it has a bug; I'll test once my indexwriter is >>> finished writing, since I don't want to interrupt it by reloading the >>> plugin.) >> "trivially larger" ... >> ugh... how trivial is trivial? >> >> the xmllibrarian can handle index_8fc42.xml on its own but all other >> 8fc4 on index_8fc4.xml. >> however, as i have stated in irc, that make index generation even slower. > > Why do the indexes have to have non-overlapping names? Can't we have both > index_8f and index_8fc42 ? And then when we fetch a term, use the appropriate > index by going for the one with the longest prefix? >
We can. In fact, XMLLibrarian handle this correctly. It is just the spider part it is tricky.