Re: Dumping raw html and javascript

Doğacan Güney Wed, 01 Oct 2008 00:38:18 -0700

On Mon, Sep 29, 2008 at 9:19 PM, Kevin MacDonald <[EMAIL PROTECTED]> wrote:
> Once I have done a crawl I have a need to pass all of the raw HTML and
> javascript that has been fetched through a custom parser. During a fetch
> does nutch store all of the raw content including HTML tags on disk?


Yes, if you have fetcher.store.content set to true (which is true by default).

Raw content of a page will be saved under <segment>/content directory.
To reach a particular content, you may try this

bin/nutch readseg -get <segment> <url> -noparse -noparsedata -nofetch
-nogenerate -noparsetext

> Thanks
>
> Kevin
>



-- 
Doğacan Güney

Re: Dumping raw html and javascript

Reply via email to