Re: Incremental crawl again ... (Please explain)

Jacob Brunson Fri, 26 May 2006 02:32:45 -0700

Yes, I see what you mean about re-indexing again over all the
segments.  However, indexing takes a lot of time and I was hoping that
merging many smaller indexes would be a much more efficient method.
Besides, deleting the index and re-indexing just doesn't seem like
*The Right Thing(tm)*.


On 5/26/06, zzcgiacomini <[EMAIL PROTECTED]> wrote:

I am not at all a Nutch expert, I am just experimenting a little bit,
but  as far as I understood it
you can remove the indexes directory and re-index again the segments:
In may case ofter step 8 of the (see below) I have only one segment :
test/segments/20060522144050
after step 9 I will have a second segment
test/segments/20060522144050
Now what we can do is to remove the test/indexes directory and
re-index the two segments:
this what I did :

hadoop dfs -rm test/indexes
nutch index test/indexes test/crawldb linkdb
test/segments/20060522144050 test/segments/20060522144050

Hope it helps

-Corrqdo



Jacob Brunson wrote:
> I looked at the referenced messaged at
> http://www.mail-archive.com/[email protected]/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:
> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
> bin/nutch generate crawl/crawldb crawl/segments -topN 500
> lastsegment=`ls -d crawl/segments/2* | tail -1`
> bin/nutch fetch $lastsegment
> bin/nutch updatedb crawl/crawldb $lastsegment
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>
> This last command fails with a java.io.IOException saying: "Output
> directory /home/nutch/nutch/crawl/indexes already exists"
>
> So I'm confused because it seems like I did exactly what was described
> in the referenced email, but it didn't work for me.  Can someone help
> me figure out what I'm doing wrong or what I need to do instead?
> Thanks,
> Jacob
>
>
> On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
>> Please do follow the link below..
>>   http://www.mail-archive.com/[email protected]/msg03990.html
>>
>>   I have been able to follow the threads as explained and merge
>> multiple crawl.. It works like a champ.
>>
>>   Thanks
>>   Sudhi
>>
>> zzcgiacomini <[EMAIL PROTECTED]> wrote:
>>   I am currently using the last nightly nutch-0.8-dev build and
>> I am really confused about how to proceed after I have done two
>> different "whole web" incremental crawl
>>
>> The tutorial to me is not clear on how to merge the results after the
>> two crawls in order to be able to
>> make a search operation.
>>
>> Could some one please give me an Hints on what is the right procedure ?!
>> here is what I am doing:
>>
>> 1. create an initial urls file /tmp/dmoz/urls.txt
>> 2. hadoop dfs -put /tmp/urls/ url
>> 3. nutch inject test/crawldb dmoz
>> 4. nutch generate test/crawldb test/segments
>> 5. nutch fetch test/segments/20060522144050
>> 6. nutch updatedb test/crawldb test/segments/20060522144050
>> 7. nutch invertlinks linkdb test/segments/20060522144050
>> 8. nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050
>>
>> ..and now I am able to search...
>>
>> Now I run
>>
>> 9. nutch generate test/crawldb test/segments -topN 1000
>>
>> and I will end up to have a new segment : test/segments/20060522151957
>>
>> 10. nutch fetch test/segments/20060522151957
>> 11. nutch updatedb test/crawldb test/segments/20060522151957
>>
>>
>> From this point on I cannot make any progresses much
>>
>> A) I have tried to merge the two segments into a new one with the
>> idea to rerun an invertlinks and index on it but:
>>
>> nutch mergesegs test/segments -dir test/segments
>>
>> whatever I specify as outputdir or outputsegment I get errors
>>
>> B) I have also tried to make invertlinks on all test/segments with
>> the goal to run nutch index command to produce a second
>> indexes directory, let say test/indexes1, an finally run the merge
>> index on index2
>>
>> nutch invertlinks test/linkdb -dir test/segments
>>
>> This as created a new linkdb directory *NOT* under test as specified
>> but as /linkdb-1108390519
>>
>> nutch index test/indexes1 test/crawldb linkdb
>> test/segments/20060522144050
>> nutch merge index2 test/indexes test/indexes1
>>
>> now I am not sure what to do; If I rename test/index2 to be
>> test/indexes after having removed test/indexes
>> I will not able to search anymore.
>>
>>
>> -Corrado
>>
>>
>>
>>
>>
>>
>>   Sudhi Seshachala
>>   http://sudhilogs.blogspot.com/
>>
>>
>>
>>  __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>>
>
>



--
http://JacobBrunson.com

Re: Incremental crawl again ... (Please explain)

Reply via email to