Re: [freenet-dev] Should the spider ignore common words?

Daniel Cheng Wed, 10 Jun 2009 01:27:55 -0700

On Wed, Jun 10, 2009 at 4:06 PM, Evan Daniel<[email protected]> wrote:
> On Wed, Jun 10, 2009 at 3:49 AM, Daniel Cheng<[email protected]> 
> wrote:
>> On Wed, Jun 10, 2009 at 3:18 PM, Evan Daniel<[email protected]> wrote:
>>> On Wed, Jun 10, 2009 at 2:56 AM, Daniel Cheng<[email protected]> 
>>> wrote:
>>>> On Wed, Jun 10, 2009 at 2:06 PM, Evan Daniel<[email protected]> wrote:
>>>>> On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<[email protected]> 
>>>>> wrote:
>>>>>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<[email protected]> wrote:
>>>>>>> On my (incomplete) spider index, the index file for the word "the" (it
>>>>>>> indexes no other words) is 17MB.  This seems rather large.  It might
>>>>>>> make sense to have the spider not even bother creating an index on a
>>>>>>> handful of very common words (the, be, to, of, and, a, in, I, etc).
>>>>>>> Of course, this presents the occasional difficulty:
>>>>>>> http://bash.org/?514353  I think I'm in favor of not indexing common
>>>>>>> words even so.
>>>>>>
>>>>>> Yes, it should ignore common words.
>>>>>> This is called "stopword" in search engine termology.
>>>>>>
>>>>>>>
>>>>>>> Also, on a related note, the index splitting policy should be a bit
>>>>>>> more sophisticated: in an attempt to fit within the max index size as
>>>>>>> configured, it split all the way down to index_8fc42.xml.  As a
>>>>>>> result, the file index_8fc4b.xml sits all by itself at 3KiB.  It
>>>>>>> contains the two words "vergessene" and "txjmnsm".  I suspect it would
>>>>>>> have reliability issues should anyone actually want to search either
>>>>>>> of those.  It would make more sense to have all of index_8fc4 in one
>>>>>>> file, since it would be only trivially larger.  (I have a patch that I
>>>>>>> thought did that, but it has a bug; I'll test once my indexwriter is
>>>>>>> finished writing, since I don't want to interrupt it by reloading the
>>>>>>> plugin.)
>>>>>>
>>>>>> "trivially larger" ...
>>>>>> ugh... how trivial is trivial?
>>>>>>
>>>>>> the xmllibrarian can handle  index_8fc42.xml on its own but all other
>>>>>> 8fc4 on  index_8fc4.xml.
>>>>>> however, as i have stated in irc, that make index generation even slower.
>>>>>
>>>>> 8fc42 is 17382 KiB.  All other 8fc4 are 79 KiB combined.
>>>>>
>>>>> Also, it would make index generation faster.  The spider first does
>>>>> all the work of creating 8fc4, then discards it to recreate the
>>>>> sub-indexes.  The vast majority of this work is in 8fc42, which gets
>>>>> created twice.  Not splitting the index would nearly halve the time to
>>>>
>>>> It don't get created twice, it shortcut early.
>>>> see the estimateSize variable in IndexWriter.
>>>
>>> Unless I'm mistaken, the slow part of the index creation is the
>>> term.getPages() call.  That call is where all the disk io hides, no?
>>
>> no :)
>> getPages() return a IPersistentSet (ScalableSet) which is lazy evaluated.
>>
>> Internally, it is a linkedset when small, btree when large.
>> the .size() method is always cached.
>
> In this case, I don't think it helps.  13 bytes is a gross
> underestimate of the size adding a page adds to the file.
> estimateSize isn't checked again until all the pages have been added.


Let's see if this patch make it faster:
  
http://github.com/j16sdiz/plugin-XMLSpider/commit/d6104814d41521d519f3452fcf3e3d90f795b0be


> Furthermore, that leaves the timing unexplained.

Lots of time is spend in higher level loops:
252     private boolean generateXML(PerstRoot perstRoot, String
prefix) throws IOException {
[...]
299             IterableIterator<Term> termIterator =
perstRoot.getTermIterator(prefix, prefix + "g");
300             for (Term term : termIterator) {
[...]
367             }

The getTermIterator() would walk the b-tree and fetch a few pages from there.
estimatedSize is calculated when we iternate though the termIterator.

> It takes as long to generate b70 as all the rest of b7* combined.  This is 
> fairly
> consistent across the whole set of files (obviously some variation is
> present).
>
> 2009-06-10 02:59 index_b6e.xml
> 2009-06-10 03:00 index_b6f.xml
> 2009-06-10 03:16 index_b70.xml
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Should the spider ignore common words?

Reply via email to