/magnus
<%@ page import="org.apache.lucene.index.IndexReader, org.apache.lucene.document.Document, com.technohuman.search.language.SwedishAnalyzer, java.io.StringReader, org.apache.lucene.analysis.TokenStream, org.apache.lucene.analysis.Token, org.apache.lucene.index.Term, org.apache.lucene.index.TermEnum, java.util.*"%> <%! class Entry implements Comparable { public double score; public String termText;
public Entry(double score, String termText) {
this.score = score;
this.termText = termText;
}public int compareTo(Object o) {
Entry e = (Entry) o;
if (e.score < score) return -1;
else return 1;
}
}
%>
<%
IndexReader reader = IndexReader.open(application.getRealPath("/WEB-INF/index"));
Document d = reader.document(Integer.parseInt(request.getParameter("docId")));
Map m = new HashMap();
// Count all terms in the description field of the given document
String description = d.getField("Parser.DESCRIPTION").stringValue();
final java.io.Reader r = new StringReader(description);
final TokenStream in = new SwedishAnalyzer().tokenStream(r); for (; ;) {
final Token token = in.next(); if (token == null) {
break;
} if (m.containsKey(token.termText())) {
int a = ((Integer)m.get(token.termText())).intValue();
m.put(token.termText(), new Integer(a + 1));
} else {
m.put(token.termText(), new Integer(1));
}
}ArrayList tm = new ArrayList();
// Calculate inverse document frequency * term frequency
Iterator it = m.keySet().iterator();
while (it.hasNext()) {
String termText = (String) it.next();
TermEnum te = reader.terms(new Term("Parser.DESCRIPTION", termText));
double idf = Math.log(reader.numDocs() / (te.docFreq() + 1)) + 1;
double tf = Math.sqrt(((Integer)m.get(termText)).intValue());tm.add(new Entry(idf * tf, termText)); }
Collections.sort(tm);
// Print the keywords and the score for each keyword
Iterator it2 = tm.iterator();
while (it2.hasNext()) {
Entry e = (Entry) it2.next();
out.println(e.score + " " + e.termText + "<br />");
}reader.close(); %>
Rociel Buico wrote:
hello magnus,
can i ask your sample script?
--buics
Hi Peter
If the original document is available. You could extract keywords from the document
at query time. That is when someone asks for documents similar to document a. You
re-analyze document a and in combination with statistics from the Lucene index you extract
keywords from document a that can then be used as a query for findining similar documents.
I've got some sample code if anyone is interested.
/magnus
Peter Becker wrote:
Hi Terry,
we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range).
Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections.
Peter
Terry Steichen wrote:
Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)
Regards,
Terry
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
