Re: question about link text indexing

Otis Gospodnetic Thu, 21 Oct 2004 15:48:50 -0700

Redirecting to a more appropriate lucene-user list.  Some answers are
inlined.


--- Rob McCool <[EMAIL PROTECTED]> wrote:

> Hello, I'm trying to have Lucene index a cache of HTML pages. The URL
> of
> each page is its primary identifier. We would also like to record
> link
> text for each page, that is, for all other pages that link to a page,
> record
> a Field called "linktext" which contains a stream of all the words
> that
> are used to link to that page. We can then change scoring priorities
> based
> on whether the page contains a word, and also on whether links to
> that page
> contain a word.

Nutch is probably not a solution for you, but it does what you
described above and much more.  See nutch.org.

> I'm having trouble due to the model that the IndexWriter and Document
> 
> interfaces use. My current algorithm is to create a new Document each
> time
> we put a page into the cache, or each time we encounter link text for
> a page.
> Any prior Documents found in the index corresponding to that URL have
> their
> Fields copied to the new one. The old documents are then deleted, and
> the
> new Document is added to the index.
> 
> The problem I have is that this is terribly slow. Because the
> delete()
> method is on IndexReader, I have to continually close and re-open
> IndexWriters
> and IndexReaders to avoid locking issues. This also results in a
> large number
> of deleted Documents. 

deleting and adding is best done in batches: first delete everything
you need to delete, then add everything you need to add.

> I also have to mark "contents", the Field for
> the body
> of each HTML document, as a storedTermVector Field. This results in
> larger
> indices. The impact on indexing speed is noticable: I clocked it at
> about 200%
> slower on a relatively small set of 5000 pages.

Larger indices make sense, slower indexing makes sense, too, but 200%
seems too high to me, unless your HTML pages are huge.

> My original strategy was to perform a two-pass indexing, with the
> first pass
> just recording the outgoing hyperlinks for each document, and the
> second pass
> scanning all link text into memory and then copying each Document
> into a new 
> index with all of the information together. But this means that if we
> do an
> incremental crawl of a fast-turnover site, then the entire index
> needs to be
> recomputed even if we only added a handful of pages.
> 
> What I'd like to do is to simply add a set of Terms to the postings
> for
> an existing document every time I see new link text for it. I could
> then
> use placeholders for documents I hadn't seen the contents of yet, and
> pull
> in link text from the placeholders when writing the document's
> contents.
> However, I can't figure out how to just add some entries to the
> posting table
> for a particular Term with the current indexing system.

To update a Field/Document you need to delete and re-add a Document,
unfortunately.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: question about link text indexing

Reply via email to