Re: [freenet-dev] Should the spider ignore common words?

Daniel Cheng Tue, 09 Jun 2009 22:54:26 -0700

On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<[email protected]> wrote:
> On my (incomplete) spider index, the index file for the word "the" (it
> indexes no other words) is 17MB.  This seems rather large.  It might
> make sense to have the spider not even bother creating an index on a
> handful of very common words (the, be, to, of, and, a, in, I, etc).
> Of course, this presents the occasional difficulty:
> http://bash.org/?514353  I think I'm in favor of not indexing common
> words even so.


Yes, it should ignore common words.
This is called "stopword" in search engine termology.

>
> Also, on a related note, the index splitting policy should be a bit
> more sophisticated: in an attempt to fit within the max index size as
> configured, it split all the way down to index_8fc42.xml.  As a
> result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
> contains the two words "vergessene" and "txjmnsm".  I suspect it would
> have reliability issues should anyone actually want to search either
> of those.  It would make more sense to have all of index_8fc4 in one
> file, since it would be only trivially larger.  (I have a patch that I
> thought did that, but it has a bug; I'll test once my indexwriter is
> finished writing, since I don't want to interrupt it by reloading the
> plugin.)

"trivially larger" ...
ugh... how trivial is trivial?

the xmllibrarian can handle  index_8fc42.xml on its own but all other
8fc4 on  index_8fc4.xml.
however, as i have stated in irc, that make index generation even slower.

> Evan Daniel
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Should the spider ignore common words?

Reply via email to