Any chance I could get my hands on code to "de-dup". I have a current method I think is quite sub-optimal, as I am searching the index for a dup on every insert.... Not a very good method I think...
Or any other suggestions on good ways to prevent duplicates? I am indexing with a field that has a unique ID, so it should be fairly straightforward... Thanks. On Mon, 2006-05-22 at 19:01 -0700, Eugene Tuan wrote: > I have created a method that can delete duplicate docs. Basically, > during indexing, a doc is associated with an id (a term field defined by > you.) that is indexed. Then, call the method to delete duplicates > whenever you update index. > > I haven't contributed back to Lucene community yet because our code is > in beta testing now. > > My former colleague, Chris, has received agreement from Doug Cutting > since last August that this feature is nice to have. > > Eugene > > > -----Original Message----- > From: Omar Didi [mailto:[EMAIL PROTECTED] > Sent: Monday, May 22, 2006 6:47 PM > To: java-user@lucene.apache.org > Subject: RE: Checking for duplicates inside index > > you have two choices that I can think of: > 1- before adding a document, check if it does't exist in the index. you > can do this by querying on a unique field if you have it . > 2- you can index all your documents, and once the indexing is done you > can dedupe. (Lucene has built in methods that can help with this) > > > if your index doesn't have a unique key, you need to add one like the > one you suggested. > > -----Original Message----- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Monday, May 22, 2006 6:05 PM > To: java-user@lucene.apache.org > Subject: Re: Checking for duplicates inside index > > > On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > > > > I'm indexing ~10000 documents per day but since I'm getting a lot of > > real duplicates (100% the same document content) I want to check the > > content before indexing... > > > > My idea is to create a checksum of the documents content and store it > > within document inside the index, before indexing a new document I > > will compare the new documents checksum with the ones in the index. > > > > Is that a good idea? does someone have experiences with that method? > > any tools available? > > That could work. > > You will need a big sum though. MD5? > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]