Sander Bokhorst wrote:
> 
> So if words change in a page which is being reindexed, shouldn't the
> positions of words in the word->url table be changed as well?
> If I read method 2 correctly, this implies this isn't done....

This was, well, a very simple description without getting into
all the gory details. Yes, surely the position is changed.
Actually this is a list of positions for each URL.

> 
> 2-7-02 16:29:16, Kir Kolyshkin <[EMAIL PROTECTED]> wrote:
> 
> >yayaivan wrote:
> >>
> >> Hi,
> >>
> >> I don't use citation. Because it take a lot of disk space, I delete everything
> >> from citation table and set "IncrementalCitations no" in aspseek.conf and 
>searchd.conf
> >
> >I wonder who told you that you can do so?
> >
> >> But now, indexer is runing in strange manner. After finishing indexing
> > sites, I notice that some processes still work, and they are inserting
> >> data in citation table:(
> >> I look in conf files, and notice only now this : "You MUST NOT
> >> change value of this parameter for not empty database". but I already did it:(
> >> How can I now correctly stop indexer work with citations?
> >
> >No way. Cached copy of each file is needed for correct reindexing
> >of the page. Let's assume that you have a page with two words in
> >it: "memory" and "penny". Upon the first indexing, its compressed
> >cached copy is saved in the database, and when an URL_ID is assigned
> >to the page (let's assume it is 101).
> >
> >Next, words are saved into inverted index: word -> urls. So, we have two
> >records in wordurl table:
> >
> >....
> >memory -> 101
> >....
> >penny -> 101
> >
> >(Actually the word position and some other info is saved together with
> >URL_ID, but I will skip it here for clarity).
> >
> >Now note that the words "memory" and "penny" can appear not only
> >in this page, but on the many other pages as well. And there are
> >a countless number of words. So actually we do end up with a very
> >big table.
> >
> >During the next reindexing, if the document is changed, we need to
> >clear the works that are no longer in the document, and add new words.
> >This can be done in two ways:
> >
> >1. Remove URL_ID 101 from all tables, and add all words.
> >   This is very inefficient because finding all occurences of 101
> >   in all wordurls can take several minutes
> >
> >2. Find out what words have disappeared from the page and are
> >   to be deleted, and what new words are found in the page and
> >   are to be inserted.
> >
> >Method number 2 is more practical, but we need to know what words
> >were in the document when it was indexed previous time. Again,
> >scanning all wordurl records is way too long.
> >
> >That's why aspseek saves a copy of the page indexed, and uses
> >it upon reindexing to create a "delta" (changes) between
> >two versions of the page. If you have deleted this copy,
> >index is just not able to work any more.
> >
> >And last, but not least. Option "IncrementalCitation" does not
> >switch saving a cached copy of the document. It just turns on
> >a special enchanced more of reindexing which is faster and requires
> >less memory, but is not compatible with aspseek-1.0 format.
> >So is is here just for backward compatibility, and probably
> >will be removed in aspseek-1.3.
> >
> >-- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 --
> >   Guinness a Day Keeps a Doctor Away (people's wisdom)
> >

-- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 --
   Guinness a Day Keeps a Doctor Away (people's wisdom)

Reply via email to