[freenet-dev] Should the spider ignore common words?

Mike Bush Thu, 11 Jun 2009 13:16:50 +0100

2009/6/10 Daniel Cheng <j16sdiz+freenet at gmail.com>:
> On 10/6/2009 20:42, Mike Bush wrote:
>> 2009/6/10 Evan Daniel<evanbd at gmail.com>:
>>> On Wed, Jun 10, 2009 at 6:49 AM, Mike Bush<mpbush at gmail.com> ?wrote:
>>>> XMLLibrarian doesn't currently support searching for phrases or rating
>>>> relevance of results based on proximity so I don't think common words
>>>> could be of any use in searches now.
>>>>
>>>> Also, I'm not sure but I think the current index doesn't include words
>>>> under 4 letters at all.
>>> If you read my previous mails, you'll see that the the spider is in
>>> fact indexing the word "the".
>>>
>>
>> Yes sorry, Ive since searched for 'who' on wanna and it is there, it
>> gave me OutOfMemoryException trying to generate the results page
>>
>
> You have get it :)
>
> This is yet another reason to split the <site> part out.


I've built 2 indexes to find the space saving from separating keys
from words as well,
 for an index > 16000 keys with 256 subindices :

The normal index with keys integrated in files >400MB
With keys in a separate key index(3MB) it totals 160MB

Of course the difference wouldn't be so large if the index wasn't
separated into so many pieces.

One thing I worried about was that the file index would get very
large, but even for the key index to be bigger than one of wanna's
subindexes it would contain > 320000 keys. How many keys do very large
indexes have?


MikeB


> In which we may keep in memory the siteId only, not the whole uri, before the 
> union.
>
> Even so, I suspect searching words like "the who" will ever work without on 
> disk temp files.
>
>>> Evan Daniel
>>>
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
>

[freenet-dev] Should the spider ignore common words?

Reply via email to