Recreating crawled documents out of Nutch indexes/segments

Venkateshprasanna Mon, 22 Sep 2008 03:55:25 -0700

Hi,

For our implementation of Nutch, I found the need to have access to the
cached versions of the crawled documents. I need to run a post-processing
task on top of all the cached documents.


I was wondering if this is the right way ahead:

1. Using getContent() method of FetchedSegments class to get the content of
text/html documents.
2. Using getParseText() method of FetchedSegments  class to get text of
other document formats.

Since this is a class under Nutch.Searcher, would this be helpful only in
getting the documents searched (i.e., HitResults), or is there a way to get
all the indexed documents?

Or, is there a simpler or better way than this?

Regards,
Venkateshprasanna.

-- 
View this message in context: 
http://www.nabble.com/Recreating-crawled-documents-out-of-Nutch-indexes-segments-tp19605603p19605603.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Recreating crawled documents out of Nutch indexes/segments

Reply via email to