Redirecting to a more appropriate lucene-user list. Some answers are inlined.
--- Rob McCool <[EMAIL PROTECTED]> wrote: > Hello, I'm trying to have Lucene index a cache of HTML pages. The URL > of > each page is its primary identifier. We would also like to record > link > text for each page, that is, for all other pages that link to a page, > record > a Field called "linktext" which contains a stream of all the words > that > are used to link to that page. We can then change scoring priorities > based > on whether the page contains a word, and also on whether links to > that page > contain a word. Nutch is probably not a solution for you, but it does what you described above and much more. See nutch.org. > I'm having trouble due to the model that the IndexWriter and Document > > interfaces use. My current algorithm is to create a new Document each > time > we put a page into the cache, or each time we encounter link text for > a page. > Any prior Documents found in the index corresponding to that URL have > their > Fields copied to the new one. The old documents are then deleted, and > the > new Document is added to the index. > > The problem I have is that this is terribly slow. Because the > delete() > method is on IndexReader, I have to continually close and re-open > IndexWriters > and IndexReaders to avoid locking issues. This also results in a > large number > of deleted Documents. deleting and adding is best done in batches: first delete everything you need to delete, then add everything you need to add. > I also have to mark "contents", the Field for > the body > of each HTML document, as a storedTermVector Field. This results in > larger > indices. The impact on indexing speed is noticable: I clocked it at > about 200% > slower on a relatively small set of 5000 pages. Larger indices make sense, slower indexing makes sense, too, but 200% seems too high to me, unless your HTML pages are huge. > My original strategy was to perform a two-pass indexing, with the > first pass > just recording the outgoing hyperlinks for each document, and the > second pass > scanning all link text into memory and then copying each Document > into a new > index with all of the information together. But this means that if we > do an > incremental crawl of a fast-turnover site, then the entire index > needs to be > recomputed even if we only added a handful of pages. > > What I'd like to do is to simply add a set of Terms to the postings > for > an existing document every time I see new link text for it. I could > then > use placeholders for documents I hadn't seen the contents of yet, and > pull > in link text from the placeholders when writing the document's > contents. > However, I can't figure out how to just add some entries to the > posting table > for a particular Term with the current indexing system. To update a Field/Document you need to delete and re-add a Document, unfortunately. Otis --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]