Gerard Sychay wrote:
Hmm, the term vector does not have to consist of only term frequencies, does it? To give weight to rare terms, could you create a term vector of (TF*IDF) values for each term? Then, a distance function would measure how many terms two vectors have in common, giving weight to how many rare terms two vectors have in common.
Yeah, but if you're gonna do that why not just form a query with all words in the source document, and let the Lucene engine do the idf/tf calculations? I've done this and it seems to work fine.
Here's code I've used. It could be done better by avoiding QueryParser, and odds are it could hit that exception for too many clauses in a boolean expression unless you configure lucene from its default, but this is the idea. "srch" is the entire body of the source document.
public static Query formSimilarQuery( String srch, Analyzer a) throws org.apache.lucene.queryParser.ParseException, IOException { StringBuffer sb = new StringBuffer(); TokenStream ts = a.tokenStream( "foo", new StringReader( srch)); org.apache.lucene.analysis.Token t; while ( (t = ts.next()) != null) { sb.append( t.termText() + " "); } return QueryParser.parse( sb.toString(),DFields.CONTENTS, a); }
David Spencer <[EMAIL PROTECTED]> 06/01/04 08:25PM >>>
Erik Hatcher wrote:
On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
Well, a question again, how does Lucene compute the score between a
document and a query?
And I might add, thus, this approach to similarity gives more weight to
rare terms that match, which one might want for this kind of similarity
measure.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
