Re: term frequency

Enis Soztutar Tue, 26 Sep 2006 08:54:46 -0700

Chris K Wensel wrote:

Hi all


I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.

for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys

so  what is the 'correct' way to inspect the nutch index for these values.
Particularly against the lucene IndexReader behind the nutch IndexSearcher.
Since I don't see anything on the Searcher interface, is there some other
hadoop-ified way to do this?

assuming there isn't, if I was to add the ability to get document and index
wide term frequencies, would this be exposed on the nutch.searcher.Searcher

interface?e.g.

Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
Searcher.getTermVector( Hit hit, String field )
Searcher.getTermVector( String field )

or is there a more relevant interface this should hang off of? Searcher
doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
IndexTermVector??

or is this just insane, it won't work like I think and I should just forget
trying to get corpus relevant info from the indexes during runtime?

cheers,
ckw

Hi,

For some statistical analysis, I also needed term frequencies across allthe collection,Since lucene only gives termfreq by document, I have calculated the termfrequencies by

summing all the frequencies of the term. the below code fragment does this:

   /**
    * Returns total occurrences of the given term.
    * @param term
    * @return #of occurrences of term.
    * @throws IOException
    */
   private int getCount(Term term) throws IOException{
       int count = 0;
       TermDocs termDocs = reader.termDocs(term);
       while(termDocs.next()) {
           count += termDocs.freq();
       }
       return count;
   }

But, this method is inefficient, since it recalculates the valueeverytime it is called. So a caching mechanism will prove useful.Alternatively, you may initially build an HashMap and store the <term,frequency> info in it.

Re: term frequency

Reply via email to