Yes, I see what you mean about re-indexing again over all the segments. However, indexing takes a lot of time and I was hoping that merging many smaller indexes would be a much more efficient method. Besides, deleting the index and re-indexing just doesn't seem like *The Right Thing(tm)*.
On 5/26/06, zzcgiacomini <[EMAIL PROTECTED]> wrote:
I am not at all a Nutch expert, I am just experimenting a little bit, but as far as I understood it you can remove the indexes directory and re-index again the segments: In may case ofter step 8 of the (see below) I have only one segment : test/segments/20060522144050 after step 9 I will have a second segment test/segments/20060522144050 Now what we can do is to remove the test/indexes directory and re-index the two segments: this what I did : hadoop dfs -rm test/indexes nutch index test/indexes test/crawldb linkdb test/segments/20060522144050 test/segments/20060522144050 Hope it helps -Corrqdo Jacob Brunson wrote: > I looked at the referenced messaged at > http://www.mail-archive.com/[email protected]/msg03990.html > but I am still having problems. > > I am running the latest checkout from subversion. > > These are the commands which I've run: > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000 > bin/nutch generate crawl/crawldb crawl/segments -topN 500 > lastsegment=`ls -d crawl/segments/2* | tail -1` > bin/nutch fetch $lastsegment > bin/nutch updatedb crawl/crawldb $lastsegment > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment > > This last command fails with a java.io.IOException saying: "Output > directory /home/nutch/nutch/crawl/indexes already exists" > > So I'm confused because it seems like I did exactly what was described > in the referenced email, but it didn't work for me. Can someone help > me figure out what I'm doing wrong or what I need to do instead? > Thanks, > Jacob > > > On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote: >> Please do follow the link below.. >> http://www.mail-archive.com/[email protected]/msg03990.html >> >> I have been able to follow the threads as explained and merge >> multiple crawl.. It works like a champ. >> >> Thanks >> Sudhi >> >> zzcgiacomini <[EMAIL PROTECTED]> wrote: >> I am currently using the last nightly nutch-0.8-dev build and >> I am really confused about how to proceed after I have done two >> different "whole web" incremental crawl >> >> The tutorial to me is not clear on how to merge the results after the >> two crawls in order to be able to >> make a search operation. >> >> Could some one please give me an Hints on what is the right procedure ?! >> here is what I am doing: >> >> 1. create an initial urls file /tmp/dmoz/urls.txt >> 2. hadoop dfs -put /tmp/urls/ url >> 3. nutch inject test/crawldb dmoz >> 4. nutch generate test/crawldb test/segments >> 5. nutch fetch test/segments/20060522144050 >> 6. nutch updatedb test/crawldb test/segments/20060522144050 >> 7. nutch invertlinks linkdb test/segments/20060522144050 >> 8. nutch index test/indexes test/crawldb linkdb >> test/segments/20060522144050 >> >> ..and now I am able to search... >> >> Now I run >> >> 9. nutch generate test/crawldb test/segments -topN 1000 >> >> and I will end up to have a new segment : test/segments/20060522151957 >> >> 10. nutch fetch test/segments/20060522151957 >> 11. nutch updatedb test/crawldb test/segments/20060522151957 >> >> >> From this point on I cannot make any progresses much >> >> A) I have tried to merge the two segments into a new one with the >> idea to rerun an invertlinks and index on it but: >> >> nutch mergesegs test/segments -dir test/segments >> >> whatever I specify as outputdir or outputsegment I get errors >> >> B) I have also tried to make invertlinks on all test/segments with >> the goal to run nutch index command to produce a second >> indexes directory, let say test/indexes1, an finally run the merge >> index on index2 >> >> nutch invertlinks test/linkdb -dir test/segments >> >> This as created a new linkdb directory *NOT* under test as specified >> but as /linkdb-1108390519 >> >> nutch index test/indexes1 test/crawldb linkdb >> test/segments/20060522144050 >> nutch merge index2 test/indexes test/indexes1 >> >> now I am not sure what to do; If I rename test/index2 to be >> test/indexes after having removed test/indexes >> I will not able to search anymore. >> >> >> -Corrado >> >> >> >> >> >> >> Sudhi Seshachala >> http://sudhilogs.blogspot.com/ >> >> >> >> __________________________________________________ >> Do You Yahoo!? >> Tired of spam? Yahoo! Mail has the best spam protection around >> http://mail.yahoo.com >> > >
-- http://JacobBrunson.com
