On Wednesday 10 June 2009 06:54:03 Daniel Cheng wrote:
> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote:
> > On my (incomplete) spider index, the index file for the word "the" (it
> > indexes no other words) is 17MB. ?This seems rather large. ?It might
> > make sense to have the spider not even bother creating an index on a
> > handful of very common words (the, be, to, of, and, a, in, I, etc).
> > Of course, this presents the occasional difficulty:
> > http://bash.org/?514353 ?I think I'm in favor of not indexing common
> > words even so.
> 
> Yes, it should ignore common words.
> This is called "stopword" in search engine termology.
> 
> >
> > Also, on a related note, the index splitting policy should be a bit
> > more sophisticated: in an attempt to fit within the max index size as
> > configured, it split all the way down to index_8fc42.xml. ?As a
> > result, the file index_8fc4b.xml sits all by itself at 3KiB. ?It
> > contains the two words "vergessene" and "txjmnsm". ?I suspect it would
> > have reliability issues should anyone actually want to search either
> > of those. ?It would make more sense to have all of index_8fc4 in one
> > file, since it would be only trivially larger. ?(I have a patch that I
> > thought did that, but it has a bug; I'll test once my indexwriter is
> > finished writing, since I don't want to interrupt it by reloading the
> > plugin.)
> 
> "trivially larger" ...
> ugh... how trivial is trivial?
> 
> the xmllibrarian can handle  index_8fc42.xml on its own but all other
> 8fc4 on  index_8fc4.xml.
> however, as i have stated in irc, that make index generation even slower.

Why do the indexes have to have non-overlapping names? Can't we have both 
index_8f and index_8fc42 ? And then when we fetch a term, use the appropriate 
index by going for the one with the longest prefix?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090610/cdbee6b3/attachment.pgp>

Reply via email to