Re: [freenet-dev] Should the spider ignore common words?

Evan Daniel Tue, 09 Jun 2009 23:07:18 -0700

On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<j16sdiz+free...@gmail.com> wrote:
> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<eva...@gmail.com> wrote:
>> On my (incomplete) spider index, the index file for the word "the" (it
>> indexes no other words) is 17MB.  This seems rather large.  It might
>> make sense to have the spider not even bother creating an index on a
>> handful of very common words (the, be, to, of, and, a, in, I, etc).
>> Of course, this presents the occasional difficulty:
>> http://bash.org/?514353  I think I'm in favor of not indexing common
>> words even so.
>
> Yes, it should ignore common words.
> This is called "stopword" in search engine termology.
>
>>
>> Also, on a related note, the index splitting policy should be a bit
>> more sophisticated: in an attempt to fit within the max index size as
>> configured, it split all the way down to index_8fc42.xml.  As a
>> result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
>> contains the two words "vergessene" and "txjmnsm".  I suspect it would
>> have reliability issues should anyone actually want to search either
>> of those.  It would make more sense to have all of index_8fc4 in one
>> file, since it would be only trivially larger.  (I have a patch that I
>> thought did that, but it has a bug; I'll test once my indexwriter is
>> finished writing, since I don't want to interrupt it by reloading the
>> plugin.)
>
> "trivially larger" ...
> ugh... how trivial is trivial?
>
> the xmllibrarian can handle  index_8fc42.xml on its own but all other
> 8fc4 on  index_8fc4.xml.
> however, as i have stated in irc, that make index generation even slower.


8fc42 is 17382 KiB.  All other 8fc4 are 79 KiB combined.

Also, it would make index generation faster.  The spider first does
all the work of creating 8fc4, then discards it to recreate the
sub-indexes.  The vast majority of this work is in 8fc42, which gets
created twice.  Not splitting the index would nearly halve the time to
create the 8fc4 set of indexes.

Of course, a more efficient algorithm for creating the indexes in the
first place would both make it far faster and make the two take
approximately the same time.

Evan Daniel
_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Should the spider ignore common words?

Reply via email to