Re: [freenet-dev] Should the spider ignore common words?

Daniel Cheng Wed, 10 Jun 2009 00:50:03 -0700

On Wed, Jun 10, 2009 at 3:18 PM, Evan Daniel<[email protected]> wrote:
> On Wed, Jun 10, 2009 at 2:56 AM, Daniel Cheng<[email protected]> 
> wrote:
>> On Wed, Jun 10, 2009 at 2:06 PM, Evan Daniel<[email protected]> wrote:
>>> On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<[email protected]> 
>>> wrote:
>>>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<[email protected]> wrote:
>>>>> On my (incomplete) spider index, the index file for the word "the" (it
>>>>> indexes no other words) is 17MB.  This seems rather large.  It might
>>>>> make sense to have the spider not even bother creating an index on a
>>>>> handful of very common words (the, be, to, of, and, a, in, I, etc).
>>>>> Of course, this presents the occasional difficulty:
>>>>> http://bash.org/?514353  I think I'm in favor of not indexing common
>>>>> words even so.
>>>>
>>>> Yes, it should ignore common words.
>>>> This is called "stopword" in search engine termology.
>>>>
>>>>>
>>>>> Also, on a related note, the index splitting policy should be a bit
>>>>> more sophisticated: in an attempt to fit within the max index size as
>>>>> configured, it split all the way down to index_8fc42.xml.  As a
>>>>> result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
>>>>> contains the two words "vergessene" and "txjmnsm".  I suspect it would
>>>>> have reliability issues should anyone actually want to search either
>>>>> of those.  It would make more sense to have all of index_8fc4 in one
>>>>> file, since it would be only trivially larger.  (I have a patch that I
>>>>> thought did that, but it has a bug; I'll test once my indexwriter is
>>>>> finished writing, since I don't want to interrupt it by reloading the
>>>>> plugin.)
>>>>
>>>> "trivially larger" ...
>>>> ugh... how trivial is trivial?
>>>>
>>>> the xmllibrarian can handle  index_8fc42.xml on its own but all other
>>>> 8fc4 on  index_8fc4.xml.
>>>> however, as i have stated in irc, that make index generation even slower.
>>>
>>> 8fc42 is 17382 KiB.  All other 8fc4 are 79 KiB combined.
>>>
>>> Also, it would make index generation faster.  The spider first does
>>> all the work of creating 8fc4, then discards it to recreate the
>>> sub-indexes.  The vast majority of this work is in 8fc42, which gets
>>> created twice.  Not splitting the index would nearly halve the time to
>>
>> It don't get created twice, it shortcut early.
>> see the estimateSize variable in IndexWriter.
>
> Unless I'm mistaken, the slow part of the index creation is the
> term.getPages() call.  That call is where all the disk io hides, no?


no :)
getPages() return a IPersistentSet (ScalableSet) which is lazy evaluated.

Internally, it is a linkedset when small, btree when large.
the .size() method is always cached.

> The "shortcut" doesn't occur until after that call returns.  As
> discussed above, "the" accounts for about 99.5% of the whole index,
> and therefore (I'm assuming) 99.5% of the disk io.  And that 99.5%
> happens twice.
>
> The shortcut only functions properly when the largest term accounts
> for a modest fraction of the total work, which is exactly what isn't
> happening here.
>
> Evan Daniel
>
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Should the spider ignore common words?

Reply via email to