hzhong wrote:
Hello,
This is what I want to do. Given a document, find all its terms and
frequencies.
I understand that Nutch is built on top of Lucene. In Lucene, I can access
the terms and their frequencies of a document via the indexreader. However,
in nutch, I am not sure if there's an equivalent. In Lucene, indexreader
needs to know where the inverted indexes are. In Nutch, I am not sure how
and where to locate the inverted indexes.
Is it possible to access the inverted index from Nutch?
What you need is named "term vector". Nutch doesn't support this out of
the box, but it;s relatively easy to add. You would have to modify
org.apache.nutch.searcher.Searcher and add a method to retrieve
TermVector - and implement this method in
org.apache.nutch.searcher.IndexSearcher using Lucene classes.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com