hi, but how to dump the content ? i tried this command :
./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate but the crawl_generate is in this path : /usr/local/nutch-1.0/crawl/segments/20091001120102 and not in this one : /usr/local/nutch-1.0/crawl/segments/20091001120102/content can you plz just give me the correct command ? thx > Date: Thu, 1 Oct 2009 18:16:43 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: R: Using Nutch for only retriving HTML > > BELLINI ADAM wrote: > > hi, > > thx for the advise, > > but guess when u run the readseg command it will not retun the pages as is > > (as if browsed ). > > i tried it and it returns information about pages : > > > > Recno:: 0 > > URL:: http://blabla.com/blabla.jsp > > > > CrawlDatum:: > > Version: 7 > > Status: 67 (linked) > > Fetch time: Mon Aug 31 16:11:26 EDT 2009 > > Modified time: Wed Dec 31 19:00:00 EST 1969 > > Retries since fetch: 0 > > Retry interval: 86400 seconds (1 days) > > Score: 8.849112E-7 > > Signature: null > > Metadata: > > > > is there another way to get the source of the page as if it will be browsed > > ? i mean as if we run wget ? > > The above record comes from <segmentDir>/crawl_parse part of segment. If > you dump the /content part then you will get the original raw content. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > _________________________________________________________________ Internet explorer 8 lets you browse the web faster. http://go.microsoft.com/?linkid=9655582