Extracting article keywords using tf-idf algorithm

Ali Nazemian Fri, 17 Jul 2015 00:28:38 -0700

Dear Lucene/Solr developers,
Hi,
I decided to develop a plugin for Solr in order to extract main keywords
from article. Since Solr already did the hard-working for calculating
tf-idf scores I decided to use that for the sake of better performance. I
know that UpdateRequestProcessor is the best suited extension point for
adding keyword value to documents. I also find out that I have not any
access to tf-idf scores inside the UpdateRequestProcessor, because of the
fact that UpdateRequestProcessor chain will be applied before the process
of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
developers I decided to go for searchComponent in order to calculate
keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
Unfortunately toward this approach, strange core behavior was observed. For
example sometimes facet wont work on this keyword field or the index
becomes unstable in search results.
I really appreciate if someone help me to make it stable.



NamedList response = new SimpleOrderedMap();
    keyword.init(searcher, params);
    BooleanQuery query = new BooleanQuery();
    for (String fieldName : keywordSourceFields) {
      TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
      query.add(termQuery, Occur.MUST_NOT);
    }
    TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
    query.add(termQuery, Occur.MUST);
    RefCounted<IndexWriter> iw = null;
    IndexWriter writer = null;
    try {
      TopDocs results = searcher.search(query, maxNumDocs);
      ScoreDoc[] hits = results.scoreDocs;
      iw = solrCoreState.getIndexWriter(core);
      writer = iw.get();
      FieldType type = new FieldType(StringField.TYPE_STORED);
      for (int i = 0; i < hits.length; i++) {
        Document document = searcher.doc(hits[i].doc);
        List<String> keywords = keyword.getKeywords(hits[i].doc);
        if (keywords.size() > 0) document.removeFields(keywordField);
        for (String word : keywords) {
          document.add(new Field(keywordField, word, type));
        }
        String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
        writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
            document);
      }
      response.add("Number of Selected Docs", results.totalHits);
      writer.commit();
    } catch (IOException | SyntaxError e) {
      throw new RuntimeException();
    } finally {
      if (iw != null) {
        iw.decref();
      }
    }


public List<String> getKeywords(int docId) throws SyntaxError {
    String[] fields = new String[keywordSourceFields.size()];
    List<String> terms = new ArrayList<String>();
    fields = keywordSourceFields.toArray(fields);
    mlt.setFieldNames(fields);
    mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
    mlt.setMinTermFreq(minTermFreq);
    mlt.setMinDocFreq(minDocFreq);
    mlt.setMinWordLen(minWordLen);
    mlt.setMaxQueryTerms(maxNumKeywords);
    mlt.setMaxNumTokensParsed(maxTokensParsed);
    try {

      terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
    } catch (IOException e) {
      LOGGER.error(e.getMessage());
      throw new RuntimeException();
    }

    return terms;
  }

Best regards.
-- 
A.Nazemian

Extracting article keywords using tf-idf algorithm

Reply via email to