Proposal: extracting term-level stats from query process

markharw00d Thu, 11 Mar 2004 03:37:56 -0800

I think the TermScorer could be used to produce some useful feedback on performance of 
terms used in queries with the addition of some new methods:
int getNumDocMatches();
float getAverageScore();


These could be used in the following scenarios:
* selecting which terms to offer spelling correction on (when numDocMatches==0)
* influencing the highlighter selections (doc fragments scored based on contained term 
weights)
* For "more like this" natural language type queries the highlighter could highlight 
only "significantly" scored terms and
ignore low-scoring noise words.

The stats accumulation code that would need adding to term scorer would add negligible 
overhead but the main issue would be how to 
expose  the TermScorer object to users.
I had initially planned to do all of this with a new class that required no Lucene 
changes. That would have looked like this:

//wrap normal query in a new query
ProfilerQuery pq=new ProfilerQuery(anyLuceneQuery);
//run query as normal
searcher.search(pq...)
//analyze results
ProfiledTermStats[] ts=pq.getTermStats()
for(int i=0;i<ts.length;i++)
{
  System.out.println(ts[i].getTerm()+" in "+ts[i].getNumMatches+
     " docs, ave score="+ts[i].getAverageScore() );
}

I quickly discovered this wasnt possible with requiring a change to the existing 
lucene code.

Anyone else find this a worthwhile change? I know it would be possible to derive all 
this information using existing 
APIs but it would effectively involve another pass of the same index data.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Proposal: extracting term-level stats from query process

Reply via email to