Dear Lucene/Solr developers,
Hi,
I decided to develop a plugin for Solr in order to extract main keywords
from article. Since Solr already did the hard-working for calculating
tf-idf scores I decided to use that for the sake of better performance. I
know that UpdateRequestProcessor is the best suited extension point for
adding keyword value to documents. I also find out that I have not any
access to tf-idf scores inside the UpdateRequestProcessor, because of the
fact that UpdateRequestProcessor chain will be applied before the process
of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
developers I decided to go for searchComponent in order to calculate
keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
Unfortunately toward this approach, strange core behavior was observed. For
example sometimes facet wont work on this keyword field or the index
becomes unstable in search results.
I really appreciate if someone help me to make it stable.
NamedList response = new SimpleOrderedMap();
keyword.init(searcher, params);
BooleanQuery query = new BooleanQuery();
for (String fieldName : keywordSourceFields) {
TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
query.add(termQuery, Occur.MUST_NOT);
}
TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
query.add(termQuery, Occur.MUST);
RefCounted<IndexWriter> iw = null;
IndexWriter writer = null;
try {
TopDocs results = searcher.search(query, maxNumDocs);
ScoreDoc[] hits = results.scoreDocs;
iw = solrCoreState.getIndexWriter(core);
writer = iw.get();
FieldType type = new FieldType(StringField.TYPE_STORED);
for (int i = 0; i < hits.length; i++) {
Document document = searcher.doc(hits[i].doc);
List<String> keywords = keyword.getKeywords(hits[i].doc);
if (keywords.size() > 0) document.removeFields(keywordField);
for (String word : keywords) {
document.add(new Field(keywordField, word, type));
}
String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
document);
}
response.add("Number of Selected Docs", results.totalHits);
writer.commit();
} catch (IOException | SyntaxError e) {
throw new RuntimeException();
} finally {
if (iw != null) {
iw.decref();
}
}
public List<String> getKeywords(int docId) throws SyntaxError {
String[] fields = new String[keywordSourceFields.size()];
List<String> terms = new ArrayList<String>();
fields = keywordSourceFields.toArray(fields);
mlt.setFieldNames(fields);
mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
mlt.setMinTermFreq(minTermFreq);
mlt.setMinDocFreq(minDocFreq);
mlt.setMinWordLen(minWordLen);
mlt.setMaxQueryTerms(maxNumKeywords);
mlt.setMaxNumTokensParsed(maxTokensParsed);
try {
terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
} catch (IOException e) {
LOGGER.error(e.getMessage());
throw new RuntimeException();
}
return terms;
}
Best regards.
--
A.Nazemian