On 4/25/07, Charlie Williams <[EMAIL PROTECTED]> wrote:
> I have an index of pages from the web, a bit over 1 million. The fetch took
> several weeks to complete, since it was mainly over a small set of domains.
> Once we had a completed fetch, and index we began trying to work with the
> retrieved text, and found that the cached text is just that, flat text. Is
> the original HTML cached anywhere that it can be accessed after the intial
> fetch? It would be a shame to have to recrawl all those pages. We are using
> Nutch  .8

If you have fetcher.store.content set to true then Nutch has stored a
copy of all the pages in <segment_dir>/content. You can extract
individual contents with the command "./nutch readseg -get
<segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext
-noparsedata".

>
> Thanks for any help.
>
> -Charlie
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to