Hello, I'm trying to have Lucene index a cache of HTML pages. The URL of each page is its primary identifier. We would also like to record link text for each page, that is, for all other pages that link to a page, record a Field called "linktext" which contains a stream of all the words that are used to link to that page. We can then change scoring priorities based on whether the page contains a word, and also on whether links to that page contain a word.
I'm having trouble due to the model that the IndexWriter and Document interfaces use. My current algorithm is to create a new Document each time we put a page into the cache, or each time we encounter link text for a page. Any prior Documents found in the index corresponding to that URL have their Fields copied to the new one. The old documents are then deleted, and the new Document is added to the index. The problem I have is that this is terribly slow. Because the delete() method is on IndexReader, I have to continually close and re-open IndexWriters and IndexReaders to avoid locking issues. This also results in a large number of deleted Documents. I also have to mark "contents", the Field for the body of each HTML document, as a storedTermVector Field. This results in larger indices. The impact on indexing speed is noticable: I clocked it at about 200% slower on a relatively small set of 5000 pages. My original strategy was to perform a two-pass indexing, with the first pass just recording the outgoing hyperlinks for each document, and the second pass scanning all link text into memory and then copying each Document into a new index with all of the information together. But this means that if we do an incremental crawl of a fast-turnover site, then the entire index needs to be recomputed even if we only added a handful of pages. What I'd like to do is to simply add a set of Terms to the postings for an existing document every time I see new link text for it. I could then use placeholders for documents I hadn't seen the contents of yet, and pull in link text from the placeholders when writing the document's contents. However, I can't figure out how to just add some entries to the posting table for a particular Term with the current indexing system. Can you give me some suggestions about how to accomplish this? Thanks, Rob --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]