> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]
> 
> I see. This information is definetely available, but you'll have to 
> extract it yourself. The key will be TermPositions 
> enumerations that you 
> can get for each term in your phrase. Then you'd walk down 
> each of these 
> TermPositions to find documents where all of the terms in your phrase 
> occur. Then you'd look at the positions in which these terms 
> occur and 
> decide if they form a phrase or not. If so, you count a hit 
> and move on. 
> This is essentially what the PhraseQuery does.

I think it would be easier to piggyback off the code in Lucene which already
does this.  The PhraseScorer class (package private) computes phrase
frequencies internally and uses them to compute the score for the phrase.
One could write a new PhraseScorere method much like PhraseScorer.score()
that directly computes frequencies, e.g.:
  public interface FreqCollector { collect(int doc, int freq); }
  public void getFrequencies(FreqCollector);

The cleanest way to make this public would probably be to something like the
following to PhraseQuery:
  public void getFrequencies(IndexReader reader, FreqCollector);
This could construct a PhraseScorer and call its getFrequencies() method.

Does that make sense?

Doug

Reply via email to