Happy New Year Everyone :) I am trying to automatically generate document Id when indexing a csv file that contains multiple lines of documents. The desired case: if the csv file contains 2 lines (each line is a document), then the index should contain 2 documents.
What I observed: If the csv files contains 2 lines, then the index contains 3 documents, because the 1st document is repeated once, an example output: <doc> <sr name ="col1"> doc1 </str> <sr name= "col2"> rank1 </str> <str name="id"> randomlyGeneratedId1</str> </doc> <doc> <sr name ="col1"> doc1 </str> <sr name= "col2"> rank1 </str> <str name="id"> randomlyGeneratedId2</str> </doc> <doc> <sr name ="col1"> doc2 </str> <sr name= "col2"> rank2 </str> <str name="id"> randomlyGeneratedId3</str> </doc> And if the csv file contains 3 lines, then the index contains 6 elements, because document 1 is repeated 3 times and document 2 is repeated twice, as following: <doc> <sr name ="col1"> doc1 </str> <sr name= "col2"> rank1 </str> <str name="id"> randomlyGeneratedId1</str> </doc> <doc> <sr name ="col1"> doc1 </str> <sr name= "col2"> rank1 </str> <str name="id"> randomlyGeneratedId2</str> </doc> <doc> <sr name ="col1"> doc2 </str> <sr name= "col2"> rank2 </str> <str name="id"> randomlyGeneratedId3</str> <doc> <sr name ="col1"> doc1 </str> <sr name= "col2"> rank1 </str> <str name="id"> randomlyGeneratedId4</str> </doc> <doc> <sr name ="col1"> doc2 </str> <sr name= "col2"> rank2 </str> <str name="id"> randomlyGeneratedId5</str> </doc> <doc> <sr name ="col1"> doc3 </str> <sr name= "col2"> rank3 </str> <str name="id"> randomlyGeneratedId6</str> </doc> Here's what I have done: 1. In my solrConfig: <updateRequestProcessorChain name="autoGenId"> <processor class="solr.UUIDUpdateProcessorFactory"> <str name="fieldName">doc_key</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">autoGenId</str> </lst> </requestHandler> 2. in schema.xml: <field name="doc_key" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name = "col1" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name = "col2" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <uniqueKey>id</uniqueKey> This problem doesn't exist when I assign an Id field, instead of using the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks like the csv file is processed one line at a time, and the index shows the entire process: so we see each previous line repeated in the output. Is there a way to not show the 'appending of previous lines', and rather just the 'final results' - so the total number of indexed document would match the input number of documents from the csv file? Many thanks, Jia --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org