Doug Cutting wrote: > >How awkward is it to open a reader, delete a document, close it, open a >writer, add a document, and then close the writer? If that's really too >much work, we could add a utility method to enacapsulate it. However, if >you're updating more than a single document, its much more efficient to >first do all the deletions, then do all the additions. > That's just it - while you are busy re-crawling a web site (which can take some substantial time), there will exist a situation when the user will not find any documents from that web site - neither old, nor new. Maybe the answer is to re-crawl in a different index directory and then move the files...
>So adding that >utility method might then encourage folks to write inefficient code. > >Perhaps the utility method to add is something like: > void updateDocs(Document[] docs, String idField); >This would delete any documents currently in an index that have the same >value for 'idField' as a document in 'docs', then add all the documents in >docs. This API would encourage batching. Its implementation would be to >open a reader, do the deletions, close the reader, open a writer, do the >additions, then close the writer. > > >>Document ids are, of course, segment-specific and change >>during merge. >>This makes searches fast, but it makes it impossible to identify a >>document. But what if we add a "special" field, or add a >>unique document >>id in some other way? The searches will still use the >>segment-specific >>ids and remain fast, but there would be a unique id assigned to each >>document that applications could use if needed and also the replace >>operation could use in the IndexWriter. Obviously, we would >>have to make >>sure that these ids can be created quickly by multiple >>writers without a >>possibility of duplicate ids. >> >>Would this work? >> > >Sure, it *could* work. But we'd need to add a new special dictionary for >document ids that is written to disk. This would be smaller and hence >faster to access than the term dictionary that is now used for document ids. >All of the indexing code (creating, merging, reading) would have to be >modified to support this id dictionary. And still, batched deletions would >be faster than intermingled insertion/deletion, just not as much. Is it >worth it? The current use of document fields for unique ids builds on >existing code, which is nice. > Hm. I'm afraid I don't follow. I'm not sure what the extra dictionary will be needed for (it think I know, but I'm not sure). Also, your proposal above seems almost the same as mine (with the updateDocs(Document[], String) method). What's missing is an ability to record deletions in the segment containing the replacing document rather then the segment containing the document being replaced. If we can do that, the deletions will be come atomic with additions of the replacing document, which I think would be great. Does this still require an extra dictionary and extra work? Sort of a way to record pending deletions, which become effective during a merge. Dmitry -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
