Re: hadoop

ogjunk-nutch Mon, 21 Apr 2008 18:23:55 -0700

Jason, you only put a file in HDFS with that -put.  You did not inject it into 
crawldb, and that's what you need to do with bin/nutch inject ....
After that, run generate, fetch2, updatedb...


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Jason Boss <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, April 21, 2008 8:36:55 PM
> Subject: Re: hadoop
> 
> >  Whole web?  What for, if not a secret?
> 
> Tinkering and perhaps more.  I used nutch back in the day but dang you
> guys have come a long ways!
> 
> >  Suggestion: don't run things as root.
> 
> I know :)
> 
> >  Have you formatted the filesystem?
> 
> Yes, I formatted the file system as per a tutorial I found online:
> bin/hadoop namenode -format
> 
> >  Can you run bin/hadoop fs -ls /user/root/crawl ?
> >
> [EMAIL PROTECTED] search]# bin/hadoop fs -ls /usr/root/crawl
> Found 0 items
> 
> Doesn't look so good...
> 
> >  Oh, if you have not injected any URLs, there is nothing to crawl in your 
> crawldb.
> >  Run bin/nutch and you will see "inject" as one of the options.
> >
> bin/hadoop dfs -put urls urls
> 
> I did a dfs -ls and it appears there.  For whole web indexing I was used to:
> 
> bin/nutch generate crawl/crawldb crawl/segments -topN 1000
> s2=`ls -d crawl/segments/2* | tail -1`
> echo $s2
> bin/nutch fetch $s2
> bin/nutch updatedb crawl/crawldb $s2
> 
> With hadoop what changes?  Do i just point to the virtual file system?
> 
> Thanks a ton!
> 
> Jason

Re: hadoop

Reply via email to