On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Have yuo checked your crawl-urlfilter.txt file ? > Make sure you have replaced your accepted domain. >
I have this in my crawl-urlfilter.txt # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ but lets' say I have yahoo, cnn, amazon, msn, google in my 'urls' files, what should my accepted domain to be? > ----- Original Message ----- > From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Saturday, April 07, 2007 8:54 AM > Subject: Re: Trying to setup Nutch > > > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> After setup, you should put the urls you want to crawl into the HDFS by > >> the > >> command : > >> $bin/hadoop dfs -put urls urls > >> > >> Maybe that's something you forgot to do and I hope it helps :) > >> > > > > I try your command, but I get this error: > > $ bin/hadoop dfs -put urls urls > > put: Target urls already exists > > > > > > I just have 1 line in my file 'urls': > > $ more urls > > http://www.yahoo.com > > > > Thanks for any help. > > > > > >> ----- Original Message ----- > >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > >> To: <[email protected]> > >> Sent: Saturday, April 07, 2007 3:08 AM > >> Subject: Trying to setup Nutch > >> > >> > Hi, > >> > > >> > i am trying to setup Nutch. > >> > I setup 1 site in my urls file: > >> > http://www.yahoo.com > >> > > >> > And then I start crawl using this command: > >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5 > >> > > >> > But I get this "No URLs to fecth", can you please tell me what am i > >> > missing? > >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5 > >> > crawl started in: crawl > >> > rootUrlDir = urls > >> > threads = 10 > >> > depth = 1 > >> > topN = 5 > >> > Injector: starting > >> > Injector: crawlDb: crawl/crawldb > >> > Injector: urlDir: urls > >> > Injector: Converting injected urls to crawl db entries. > >> > Injector: Merging injected urls into crawl db. > >> > Injector: done > >> > Generator: Selecting best-scoring urls due for fetch. > >> > Generator: starting > >> > Generator: segment: crawl/segments/20070406140513 > >> > Generator: filtering: false > >> > Generator: topN: 5 > >> > Generator: jobtracker is 'local', generating exactly one partition. > >> > Generator: 0 records selected for fetching, exiting ... > >> > Stopping at depth=0 - no more URLs to fetch. > >> > No URLs to fetch - check your seed list and URL filters. > >> > crawl finished: crawl > >> > > >> > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
