[GitHub] [lucenenet] l1x opened a new issue #487: Not sure how to avoid duplicates

GitBox Sat, 08 May 2021 13:35:23 -0700


l1x opened a new issue #487:
URL: https://github.com/apache/lucenenet/issues/487



   I have a very simple setup. I would like to index database rows. Each row 
has a unique unsigned integer 32 bit id and some text fields.
   
   ```md
   | id | brand | name |
   -------------------
   | 1 | Nikon | Nikon AF-S DX Nikkor 35 mm |
   | 2 | Cannon | Canon EF-S 55-250 mm |
   ```
   
   When I try to index these rows Lucene produces duplicates. Adding the whole 
dataset again and again.
   
   I was reading somewhere that to avoid this behaviour the 
IndexWriter.UpdateDocument() function has to be used. I am not sure how to use 
the Term properly. 
   
   ```Fsharp
         // id is the id from the database (uint32)
   
         let id = doc.GetField("id").GetStringValue()
         let term = Term("id", id)
         writer.UpdateDocument(term, doc)
   
   ```
   Document code:
   
   ```Fsharp
   let getDocument (inputDocument:Lens) =
       let id = StoredField("id", inputDocument.Id)
       let name  = TextField("name", inputDocument.Name, Field.Store.YES)
       let doc = Document()
       doc.Add(id)
       doc.Add(name)
       // return
       doc
   ```
   
   IndexWriter.update:
   
   ```Fsharp
     let addDocumentToIndex (writer:IndexWriter) (doc:Document) =
       try
         let id = doc.GetField("id").GetStringValue()
         let term = Term("id", id)
         writer.UpdateDocument(term, doc)
         writer.Flush(triggerMerge = false, applyAllDeletes = false)
         Ok "Ok"
       with ex ->
         logger <| sprintf "Exception : %s" ex.Message
         logger <| sprintf "Exception : %A" ex.StackTrace
         Error ex.Message
   ```
   
   Not sure what exactly is the problem but it is reliably producing 
duplicates. I have created a complete workflow to reproduce this here:
   
   https://gist.github.com/l1x/91c36b867acc70e8486a6bce7899332a
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [lucenenet] l1x opened a new issue #487: Not sure how to avoid duplicates

Reply via email to