Hi, For our implementation of Nutch, I found the need to have access to the cached versions of the crawled documents. I need to run a post-processing task on top of all the cached documents.
I was wondering if this is the right way ahead: 1. Using getContent() method of FetchedSegments class to get the content of text/html documents. 2. Using getParseText() method of FetchedSegments class to get text of other document formats. Since this is a class under Nutch.Searcher, would this be helpful only in getting the documents searched (i.e., HitResults), or is there a way to get all the indexed documents? Or, is there a simpler or better way than this? Regards, Venkateshprasanna. -- View this message in context: http://www.nabble.com/Recreating-crawled-documents-out-of-Nutch-indexes-segments-tp19605603p19605603.html Sent from the Nutch - User mailing list archive at Nabble.com.
