Chris K Wensel wrote:
Hi all

I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.

for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys

so  what is the 'correct' way to inspect the nutch index for these values.
Particularly against the lucene IndexReader behind the nutch IndexSearcher.
Since I don't see anything on the Searcher interface, is there some other
hadoop-ified way to do this?

assuming there isn't, if I was to add the ability to get document and index
wide term frequencies, would this be exposed on the nutch.searcher.Searcher
interface? e.g.
Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
Searcher.getTermVector( Hit hit, String field )
Searcher.getTermVector( String field )

or is there a more relevant interface this should hang off of? Searcher
doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
IndexTermVector??

or is this just insane, it won't work like I think and I should just forget
trying to get corpus relevant info from the indexes during runtime?

cheers,
ckw


Hi,

For some statistical analysis, I also needed term frequencies across all the collection, Since lucene only gives termfreq by document, I have calculated the term frequencies by
summing all the frequencies of the term. the below code fragment does this:

   /**
    * Returns total occurrences of the given term.
    * @param term
    * @return #of occurrences of term.
    * @throws IOException
    */
   private int getCount(Term term) throws IOException{
       int count = 0;
       TermDocs termDocs = reader.termDocs(term);
       while(termDocs.next()) {
           count += termDocs.freq();
       }
       return count;
   }


But, this method is inefficient, since it recalculates the value everytime it is called. So a caching mechanism will prove useful. Alternatively, you may initially build an HashMap and store the <term, frequency> info in it.




Reply via email to