Happy New Year Everyone :)
I am trying to automatically generate document Id when indexing a csv
file that contains multiple lines of documents. The desired case: if the
csv file contains 2 lines (each line is a document), then the index
should contain 2 documents.
What I observed: If the csv files contains 2 lines, then the index
contains 3 documents, because the 1st document is repeated once, an
example output:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
</doc>
And if the csv file contains 3 lines, then the index contains 6 elements,
because document 1 is repeated 3 times and document 2 is repeated twice,
as following:
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId1</str>
</doc>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId2</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId3</str>
<doc>
<sr name ="col1"> doc1 </str>
<sr name= "col2"> rank1 </str>
<str name="id"> randomlyGeneratedId4</str>
</doc>
<doc>
<sr name ="col1"> doc2 </str>
<sr name= "col2"> rank2 </str>
<str name="id"> randomlyGeneratedId5</str>
</doc>
<doc>
<sr name ="col1"> doc3 </str>
<sr name= "col2"> rank3 </str>
<str name="id"> randomlyGeneratedId6</str>
</doc>
Here's what I have done:
1. In my solrConfig:
<updateRequestProcessorChain name="autoGenId">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">doc_key</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">autoGenId</str>
</lst>
</requestHandler>
2. in schema.xml:
<field name="doc_key" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
<field name = "col1" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
<field name = "col2" type="string" indexed="true" stored="true"
required="true" multiValued="false"/>
<uniqueKey>id</uniqueKey>
This problem doesn't exist when I assign an Id field, instead of using
the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
like the csv file is processed one line at a time, and the index shows
the entire process: so we see each previous line repeated in the output.
Is there a way to not show the 'appending of previous lines', and
rather just the 'final results' - so the total number of indexed
document would match the input number of documents from the csv file?
Many thanks,
Jia
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]