hzhong wrote: > Hello, > > This is what I want to do. Given a document, find all its terms and > frequencies. > > I understand that Nutch is built on top of Lucene. In Lucene, I can access > the terms and their frequencies of a document via the indexreader. However, > in nutch, I am not sure if there's an equivalent. In Lucene, indexreader > needs to know where the inverted indexes are. In Nutch, I am not sure how > and where to locate the inverted indexes. > > Is it possible to access the inverted index from Nutch? >
What you need is named "term vector". Nutch doesn't support this out of the box, but it;s relatively easy to add. You would have to modify org.apache.nutch.searcher.Searcher and add a method to retrieve TermVector - and implement this method in org.apache.nutch.searcher.IndexSearcher using Lucene classes. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
