[freenet-dev] Should the spider ignore common words?

Daniel Cheng Wed, 10 Jun 2009 23:02:01 +0800

On 10/6/2009 18:04, Matthew Toseland wrote:
> On Wednesday 10 June 2009 06:54:03 Daniel Cheng wrote:
>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com>  wrote:
>>> On my (incomplete) spider index, the index file for the word "the" (it
>>> indexes no other words) is 17MB.  This seems rather large.  It might
>>> make sense to have the spider not even bother creating an index on a
>>> handful of very common words (the, be, to, of, and, a, in, I, etc).
>>> Of course, this presents the occasional difficulty:
>>> http://bash.org/?514353  I think I'm in favor of not indexing common
>>> words even so.
>> Yes, it should ignore common words.
>> This is called "stopword" in search engine termology.
>>
>>> Also, on a related note, the index splitting policy should be a bit
>>> more sophisticated: in an attempt to fit within the max index size as
>>> configured, it split all the way down to index_8fc42.xml.  As a
>>> result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
>>> contains the two words "vergessene" and "txjmnsm".  I suspect it would
>>> have reliability issues should anyone actually want to search either
>>> of those.  It would make more sense to have all of index_8fc4 in one
>>> file, since it would be only trivially larger.  (I have a patch that I
>>> thought did that, but it has a bug; I'll test once my indexwriter is
>>> finished writing, since I don't want to interrupt it by reloading the
>>> plugin.)
>> "trivially larger" ...
>> ugh... how trivial is trivial?
>>
>> the xmllibrarian can handle  index_8fc42.xml on its own but all other
>> 8fc4 on  index_8fc4.xml.
>> however, as i have stated in irc, that make index generation even slower.
>
> Why do the indexes have to have non-overlapping names? Can't we have both 
> index_8f and index_8fc42 ? And then when we fetch a term, use the appropriate 
> index by going for the one with the longest prefix?
>


We can.
In fact, XMLLibrarian handle this correctly.

It is just the spider part it is tricky.

[freenet-dev] Should the spider ignore common words?

Reply via email to