Re: Incremental crawl again ... (Please explain)

Stefan Neufeind Fri, 26 May 2006 04:06:01 -0700

I haven't yet tried - but could you maybe:
- move the new segments somewhere independent of the existing ones
- create a separate linkdb for it (to my understanding the linkdb is
only needed when indexing)
- create a separate index on that
- then move segment into segments-dir and new index into indexes-dir as
"part-XXXX"
- just merge indexes (should work relatively fast)


In the long term your segments, indexes etc. add up - so in this case
you'd need to maybe think about merging segments etc.

Also, this is "only" my current understanding of the topic. It would be
nice to get feedback and maybe easier solutions from others as well.



Regards,
 Stefan

Jacob Brunson wrote:
> Yes, I see what you mean about re-indexing again over all the
> segments.  However, indexing takes a lot of time and I was hoping that
> merging many smaller indexes would be a much more efficient method.
> Besides, deleting the index and re-indexing just doesn't seem like
> *The Right Thing(tm)*.
> 
> On 5/26/06, zzcgiacomini <[EMAIL PROTECTED]> wrote:
>> I am not at all a Nutch expert, I am just experimenting a little bit,
>> but  as far as I understood it
>> you can remove the indexes directory and re-index again the segments:
>> In may case ofter step 8 of the (see below) I have only one segment :
>> test/segments/20060522144050
>> after step 9 I will have a second segment
>> test/segments/20060522144050
>> Now what we can do is to remove the test/indexes directory and
>> re-index the two segments:
>> this what I did :
>>
>> hadoop dfs -rm test/indexes
>> nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050 test/segments/20060522144050
>>
>> Hope it helps
>>
>> -Corrqdo
>>
>>
>>
>> Jacob Brunson wrote:
>> > I looked at the referenced messaged at
>> > http://www.mail-archive.com/[email protected]/msg03990.html
>> > but I am still having problems.
>> >
>> > I am running the latest checkout from subversion.
>> >
>> > These are the commands which I've run:
>> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
>> > bin/nutch generate crawl/crawldb crawl/segments -topN 500
>> > lastsegment=`ls -d crawl/segments/2* | tail -1`
>> > bin/nutch fetch $lastsegment
>> > bin/nutch updatedb crawl/crawldb $lastsegment
>> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>> >
>> > This last command fails with a java.io.IOException saying: "Output
>> > directory /home/nutch/nutch/crawl/indexes already exists"
>> >
>> > So I'm confused because it seems like I did exactly what was described
>> > in the referenced email, but it didn't work for me.  Can someone help
>> > me figure out what I'm doing wrong or what I need to do instead?
>> > Thanks,
>> > Jacob
>> >
>> >
>> > On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
>> >> Please do follow the link below..
>> >>  
>> http://www.mail-archive.com/[email protected]/msg03990.html
>> >>
>> >>   I have been able to follow the threads as explained and merge
>> >> multiple crawl.. It works like a champ.
>> >>
>> >>   Thanks
>> >>   Sudhi
>> >>
>> >> zzcgiacomini <[EMAIL PROTECTED]> wrote:
>> >>   I am currently using the last nightly nutch-0.8-dev build and
>> >> I am really confused about how to proceed after I have done two
>> >> different "whole web" incremental crawl
>> >>
>> >> The tutorial to me is not clear on how to merge the results after the
>> >> two crawls in order to be able to
>> >> make a search operation.
>> >>
>> >> Could some one please give me an Hints on what is the right
>> procedure ?!
>> >> here is what I am doing:
>> >>
>> >> 1. create an initial urls file /tmp/dmoz/urls.txt
>> >> 2. hadoop dfs -put /tmp/urls/ url
>> >> 3. nutch inject test/crawldb dmoz
>> >> 4. nutch generate test/crawldb test/segments
>> >> 5. nutch fetch test/segments/20060522144050
>> >> 6. nutch updatedb test/crawldb test/segments/20060522144050
>> >> 7. nutch invertlinks linkdb test/segments/20060522144050
>> >> 8. nutch index test/indexes test/crawldb linkdb
>> >> test/segments/20060522144050
>> >>
>> >> ..and now I am able to search...
>> >>
>> >> Now I run
>> >>
>> >> 9. nutch generate test/crawldb test/segments -topN 1000
>> >>
>> >> and I will end up to have a new segment : test/segments/20060522151957
>> >>
>> >> 10. nutch fetch test/segments/20060522151957
>> >> 11. nutch updatedb test/crawldb test/segments/20060522151957
>> >>
>> >>
>> >> From this point on I cannot make any progresses much
>> >>
>> >> A) I have tried to merge the two segments into a new one with the
>> >> idea to rerun an invertlinks and index on it but:
>> >>
>> >> nutch mergesegs test/segments -dir test/segments
>> >>
>> >> whatever I specify as outputdir or outputsegment I get errors
>> >>
>> >> B) I have also tried to make invertlinks on all test/segments with
>> >> the goal to run nutch index command to produce a second
>> >> indexes directory, let say test/indexes1, an finally run the merge
>> >> index on index2
>> >>
>> >> nutch invertlinks test/linkdb -dir test/segments
>> >>
>> >> This as created a new linkdb directory *NOT* under test as specified
>> >> but as /linkdb-1108390519
>> >>
>> >> nutch index test/indexes1 test/crawldb linkdb
>> >> test/segments/20060522144050
>> >> nutch merge index2 test/indexes test/indexes1
>> >>
>> >> now I am not sure what to do; If I rename test/index2 to be
>> >> test/indexes after having removed test/indexes
>> >> I will not able to search anymore.

Re: Incremental crawl again ... (Please explain)

Reply via email to