Re: Similar Document Search

Magnus Johansson Tue, 19 Aug 2003 00:23:05 -0700

Ok, here it is. It's part of a JSP that prints out all keywords in a document.

/magnus


<%@ page import="org.apache.lucene.index.IndexReader,
                org.apache.lucene.document.Document,
                com.technohuman.search.language.SwedishAnalyzer,
                java.io.StringReader,
                org.apache.lucene.analysis.TokenStream,
                org.apache.lucene.analysis.Token,
                org.apache.lucene.index.Term,
                org.apache.lucene.index.TermEnum,
                java.util.*"%>
<%!
   class Entry implements Comparable {
       public double score;
       public String termText;

       public Entry(double score, String termText) {
           this.score = score;
           this.termText = termText;
       }

public int compareTo(Object o) { Entry e = (Entry) o; if (e.score < score) return -1; else return 1; } } %> <% IndexReader reader = IndexReader.open(application.getRealPath("/WEB-INF/index")); Document d = reader.document(Integer.parseInt(request.getParameter("docId")));

Map m = new HashMap();

   // Count all terms in the description field of the given document
   String description = d.getField("Parser.DESCRIPTION").stringValue();
   final java.io.Reader r = new StringReader(description);
   final TokenStream in = new SwedishAnalyzer().tokenStream(r);

   for (; ;) {
       final Token token = in.next();

       if (token == null) {
           break;
       }

       if (m.containsKey(token.termText())) {
           int a = ((Integer)m.get(token.termText())).intValue();
           m.put(token.termText(), new Integer(a + 1));
       } else {
           m.put(token.termText(), new Integer(1));
       }
   }

ArrayList tm = new ArrayList();

// Calculate inverse document frequency * term frequency Iterator it = m.keySet().iterator(); while (it.hasNext()) { String termText = (String) it.next(); TermEnum te = reader.terms(new Term("Parser.DESCRIPTION", termText));

       double idf = Math.log(reader.numDocs() / (te.docFreq() + 1)) + 1;
       double tf = Math.sqrt(((Integer)m.get(termText)).intValue());

       tm.add(new Entry(idf * tf, termText));
   }

Collections.sort(tm);

   // Print the keywords and the score for each keyword
   Iterator it2 = tm.iterator();
   while (it2.hasNext()) {
       Entry e = (Entry) it2.next();
       out.println(e.score + " " + e.termText + "<br />");
   }

   reader.close();
%>

Rociel Buico wrote:

hello magnus,

can i ask your sample script?

--buics

Hi Peter

If the original document is available. You could extract keywords from the document at query time. That is when someone asks for documents similar to document a. You re-analyze document a and in combination with statistics from the Lucene index you extract keywords from document a that can then be used as a query for findining similar documents.

I've got some sample code if anyone is interested.

/magnus

Peter Becker wrote:
Hi Terry,

we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections.

Peter

Terry Steichen wrote:

Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)

Regards,

Terry
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------- Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similar Document Search

Reply via email to