I looked at the referenced messaged at http://www.mail-archive.com/[email protected]/msg03990.html but I am still having problems.
I am running the latest checkout from subversion. These are the commands which I've run: bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000 bin/nutch generate crawl/crawldb crawl/segments -topN 500 lastsegment=`ls -d crawl/segments/2* | tail -1` bin/nutch fetch $lastsegment bin/nutch updatedb crawl/crawldb $lastsegment bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment This last command fails with a java.io.IOException saying: "Output directory /home/nutch/nutch/crawl/indexes already exists" So I'm confused because it seems like I did exactly what was described in the referenced email, but it didn't work for me. Can someone help me figure out what I'm doing wrong or what I need to do instead? Thanks, Jacob On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
Please do follow the link below.. http://www.mail-archive.com/[email protected]/msg03990.html I have been able to follow the threads as explained and merge multiple crawl.. It works like a champ. Thanks Sudhi zzcgiacomini <[EMAIL PROTECTED]> wrote: I am currently using the last nightly nutch-0.8-dev build and I am really confused about how to proceed after I have done two different "whole web" incremental crawl The tutorial to me is not clear on how to merge the results after the two crawls in order to be able to make a search operation. Could some one please give me an Hints on what is the right procedure ?! here is what I am doing: 1. create an initial urls file /tmp/dmoz/urls.txt 2. hadoop dfs -put /tmp/urls/ url 3. nutch inject test/crawldb dmoz 4. nutch generate test/crawldb test/segments 5. nutch fetch test/segments/20060522144050 6. nutch updatedb test/crawldb test/segments/20060522144050 7. nutch invertlinks linkdb test/segments/20060522144050 8. nutch index test/indexes test/crawldb linkdb test/segments/20060522144050 ..and now I am able to search... Now I run 9. nutch generate test/crawldb test/segments -topN 1000 and I will end up to have a new segment : test/segments/20060522151957 10. nutch fetch test/segments/20060522151957 11. nutch updatedb test/crawldb test/segments/20060522151957 From this point on I cannot make any progresses much A) I have tried to merge the two segments into a new one with the idea to rerun an invertlinks and index on it but: nutch mergesegs test/segments -dir test/segments whatever I specify as outputdir or outputsegment I get errors B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second indexes directory, let say test/indexes1, an finally run the merge index on index2 nutch invertlinks test/linkdb -dir test/segments This as created a new linkdb directory *NOT* under test as specified but as /linkdb-1108390519 nutch index test/indexes1 test/crawldb linkdb test/segments/20060522144050 nutch merge index2 test/indexes test/indexes1 now I am not sure what to do; If I rename test/index2 to be test/indexes after having removed test/indexes I will not able to search anymore. -Corrado Sudhi Seshachala http://sudhilogs.blogspot.com/ __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
-- http://JacobBrunson.com
