Re: Incremental crawl again ... (Please explain)

Zaheed Haque Thu, 25 May 2006 04:56:35 -0700

On 5/25/06, Jacob Brunson <[EMAIL PROTECTED]> wrote:

I looked at the referenced messaged at
http://www.mail-archive.com/[email protected]/msg03990.html
but I am still having problems.


I am running the latest checkout from subversion.

These are the commands which I've run:

bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000


bin/nutch crawl - is a one shot command to fetch/generate/index a
nutch index. I would NOT recommend one to use this one shot command.

Please take the long route which will give you more control over your
tasks. The long route meaning - inject, generate, fetch, updatedb,
index, dedup, merge. Please see the following -
Whole web crawling...

http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling

Cheers

bin/nutch generate crawl/crawldb crawl/segments -topN 500
lastsegment=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $lastsegment
bin/nutch updatedb crawl/crawldb $lastsegment
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment

This last command fails with a java.io.IOException saying: "Output
directory /home/nutch/nutch/crawl/indexes already exists"

So I'm confused because it seems like I did exactly what was described
in the referenced email, but it didn't work for me.  Can someone help
me figure out what I'm doing wrong or what I need to do instead?
Thanks,
Jacob


On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
> Please do follow the link below..
>   http://www.mail-archive.com/[email protected]/msg03990.html
>
>   I have been able to follow the threads as explained and merge multiple 
crawl.. It works like a champ.
>
>   Thanks
>   Sudhi
>
> zzcgiacomini <[EMAIL PROTECTED]> wrote:
>   I am currently using the last nightly nutch-0.8-dev build and
> I am really confused about how to proceed after I have done two
> different "whole web" incremental crawl
>
> The tutorial to me is not clear on how to merge the results after the
> two crawls in order to be able to
> make a search operation.
>
> Could some one please give me an Hints on what is the right procedure ?!
> here is what I am doing:
>
> 1. create an initial urls file /tmp/dmoz/urls.txt
> 2. hadoop dfs -put /tmp/urls/ url
> 3. nutch inject test/crawldb dmoz
> 4. nutch generate test/crawldb test/segments
> 5. nutch fetch test/segments/20060522144050
> 6. nutch updatedb test/crawldb test/segments/20060522144050
> 7. nutch invertlinks linkdb test/segments/20060522144050
> 8. nutch index test/indexes test/crawldb linkdb
> test/segments/20060522144050
>
> ..and now I am able to search...
>
> Now I run
>
> 9. nutch generate test/crawldb test/segments -topN 1000
>
> and I will end up to have a new segment : test/segments/20060522151957
>
> 10. nutch fetch test/segments/20060522151957
> 11. nutch updatedb test/crawldb test/segments/20060522151957
>
>
> From this point on I cannot make any progresses much
>
> A) I have tried to merge the two segments into a new one with the idea to 
rerun an invertlinks and index on it but:
>
> nutch mergesegs test/segments -dir test/segments
>
> whatever I specify as outputdir or outputsegment I get errors
>
> B) I have also tried to make invertlinks on all test/segments with the goal 
to run nutch index command to produce a second
> indexes directory, let say test/indexes1, an finally run the merge index on 
index2
>
> nutch invertlinks test/linkdb -dir test/segments
>
> This as created a new linkdb directory *NOT* under test as specified but as 
/linkdb-1108390519
>
> nutch index test/indexes1 test/crawldb linkdb test/segments/20060522144050
> nutch merge index2 test/indexes test/indexes1
>
> now I am not sure what to do; If I rename test/index2 to be test/indexes 
after having removed test/indexes
> I will not able to search anymore.
>
>
> -Corrado
>
>
>
>
>
>
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>
>
>
>  __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


--
http://JacobBrunson.com

Re: Incremental crawl again ... (Please explain)

Reply via email to