RE: R: Using Nutch for only retriving HTML

BELLINI ADAM Thu, 01 Oct 2009 09:51:22 -0700

hi,
but how to dump the content  ? i tried this command :



./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto

and it said :

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
  

but the crawl_generate is in this path :

/usr/local/nutch-1.0/crawl/segments/20091001120102

and not in this one :

/usr/local/nutch-1.0/crawl/segments/20091001120102/content

can you plz just give me the correct command ?

thx



> Date: Thu, 1 Oct 2009 18:16:43 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
> 
> BELLINI ADAM wrote:
> > hi,
> > thx for the advise,
> > but guess when u run the readseg command it will not retun the pages as is 
> > (as if browsed ).
> > i tried it and it returns  information about pages :
> > 
> > Recno:: 0
> > URL:: http://blabla.com/blabla.jsp
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Aug 31 16:11:26 EDT 2009
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 86400 seconds (1 days)
> > Score: 8.849112E-7
> > Signature: null
> > Metadata:
> > 
> > is there another way to get the source of the page as if it will be browsed 
> > ? i mean as if we run wget ?
> 
> The above record comes from <segmentDir>/crawl_parse part of segment. If 
> you dump the /content part then you will get the original raw content.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
                                          
_________________________________________________________________
Internet explorer 8 lets you browse the web faster.
http://go.microsoft.com/?linkid=9655582

RE: R: Using Nutch for only retriving HTML

Reply via email to