Chris K Wensel wrote:
> Hi all
>
> I'm interested in playing with term frequency values in a nutch index on a
> per document and index wide scope.
>
> for example, something similar to this lucene faq entry.
> http://tinyurl.com/ra3ys
>
> so what is the 'correct' way to inspect the nutch index for these values.
> Particularly against the lucene IndexReader behind the nutch IndexSearcher.
> Since I don't see anything on the Searcher interface, is there some other
> hadoop-ified way to do this?
>
> assuming there isn't, if I was to add the ability to get document and index
> wide term frequencies, would this be exposed on the nutch.searcher.Searcher
> interface?
>
> e.g.
>
> Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
> Searcher.getTermVector( Hit hit, String field )
> Searcher.getTermVector( String field )
>
> or is there a more relevant interface this should hang off of? Searcher
> doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
> IndexTermVector??
>
> or is this just insane, it won't work like I think and I should just forget
> trying to get corpus relevant info from the indexes during runtime?
>
> cheers,
> ckw
>
>
>
Hi,
For some statistical analysis, I also needed term frequencies across all
the collection,
Since lucene only gives termfreq by document, I have calculated the term
frequencies by
summing all the frequencies of the term. the below code fragment does this:
/**
* Returns total occurrences of the given term.
* @param term
* @return #of occurrences of term.
* @throws IOException
*/
private int getCount(Term term) throws IOException{
int count = 0;
TermDocs termDocs = reader.termDocs(term);
while(termDocs.next()) {
count += termDocs.freq();
}
return count;
}
But, this method is inefficient, since it recalculates the value
everytime it is called. So a caching mechanism will prove useful.
Alternatively, you may initially build an HashMap and store the <term,
frequency> info in it.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general