Actually that command is for distribute configuration on multiple machines.
The tutorial you refered to is for entry-level users who typically don't 
need distribute utility.

According to your description, I guess you're using Nutch on a single 
machine which makes that command useless to you.

But when you decide to deploy Nutch to multiple machines to do something 
big, you have much more to do than that tutorial tells you,including that 
command :)

----- Original Message -----
From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, April 07, 2007 9:12 AM
Subject: Re: Trying to setup Nutch

> On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> So that's the problem : you have to replace MY.DOMAIN.NAME with domains 
>> you
>> want to crawl.
>> For your situation, that line should reads :
>> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
>> Check it out.
>>
>
> Thanks for your help.
> but from the documtation
> http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
> this:
> $bin/hadoop dfs -put urls urls
>
> but I should do this for crawling:
>
> $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>
> Why do I need to do this, and what is that for?
> $bin/hadoop dfs -put urls urls
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Saturday, April 07, 2007 9:02 AM
>> Subject: Re: Trying to setup Nutch
>>
>> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> >> Have yuo checked your crawl-urlfilter.txt file ?
>> >> Make sure you have replaced your accepted domain.
>> >>
>> >
>> > I have this in my crawl-urlfilter.txt
>> >
>> > # accept hosts in MY.DOMAIN.NAME
>> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> >
>> >
>> > but lets' say I have
>> > yahoo, cnn, amazon, msn, google
>> > in my 'urls' files, what should my accepted domain to be?
>> >
>> >
>> >> ----- Original Message -----
>> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> >> To: <[email protected]>
>> >> Sent: Saturday, April 07, 2007 8:54 AM
>> >> Subject: Re: Trying to setup Nutch
>> >>
>> >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> >> >> After setup, you should put the urls you want to crawl into the 
>> >> >> HDFS
>> >> >> by
>> >> >> the
>> >> >> command :
>> >> >> $bin/hadoop dfs -put urls urls
>> >> >>
>> >> >> Maybe that's something you forgot to do and I hope it helps :)
>> >> >>
>> >> >
>> >> > I try your command, but I get this error:
>> >> > $ bin/hadoop dfs -put urls urls
>> >> > put: Target urls already exists
>> >> >
>> >> >
>> >> > I just have 1 line in my file 'urls':
>> >> > $ more urls
>> >> > http://www.yahoo.com
>> >> >
>> >> > Thanks for any help.
>> >> >
>> >> >
>> >> >> ----- Original Message -----
>> >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> >> >> To: <[email protected]>
>> >> >> Sent: Saturday, April 07, 2007 3:08 AM
>> >> >> Subject: Trying to setup Nutch
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > i am trying to setup Nutch.
>> >> >> > I setup 1 site in my urls file:
>> >> >> > http://www.yahoo.com
>> >> >> >
>> >> >> > And then I start crawl using this command:
>> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >> >
>> >> >> > But I get this "No URLs to fecth", can you please tell me what am 
>> >> >> > i
>> >> >> > missing?
>> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >> > crawl started in: crawl
>> >> >> > rootUrlDir = urls
>> >> >> > threads = 10
>> >> >> > depth = 1
>> >> >> > topN = 5
>> >> >> > Injector: starting
>> >> >> > Injector: crawlDb: crawl/crawldb
>> >> >> > Injector: urlDir: urls
>> >> >> > Injector: Converting injected urls to crawl db entries.
>> >> >> > Injector: Merging injected urls into crawl db.
>> >> >> > Injector: done
>> >> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> >> > Generator: starting
>> >> >> > Generator: segment: crawl/segments/20070406140513
>> >> >> > Generator: filtering: false
>> >> >> > Generator: topN: 5
>> >> >> > Generator: jobtracker is 'local', generating exactly one 
>> >> >> > partition.
>> >> >> > Generator: 0 records selected for fetching, exiting ...
>> >> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> >> > No URLs to fetch - check your seed list and URL filters.
>> >> >> > crawl finished: crawl
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to