On 4/25/07, Charlie Williams <[EMAIL PROTECTED]> wrote: > I have an index of pages from the web, a bit over 1 million. The fetch took > several weeks to complete, since it was mainly over a small set of domains. > Once we had a completed fetch, and index we began trying to work with the > retrieved text, and found that the cached text is just that, flat text. Is > the original HTML cached anywhere that it can be accessed after the intial > fetch? It would be a shame to have to recrawl all those pages. We are using > Nutch .8
If you have fetcher.store.content set to true then Nutch has stored a copy of all the pages in <segment_dir>/content. You can extract individual contents with the command "./nutch readseg -get <segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext -noparsedata". > > Thanks for any help. > > -Charlie > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers