[Nutch-general] Incremental crawl again ... (Please explain)

zzcgiacomini Mon, 22 May 2006 08:48:53 -0700

I am currently using the last nightly nutch-0.8-dev build and

I am really confused about how to proceed after I have done twodifferent "whole web" incremental crawl

The tutorial to me is not clear on how to merge the results after thetwo crawls in order to be able to

make a search operation.

Could some one please give me an Hints on what is the right procedure ?!here is what I am doing:


1. create an initial urls file  /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb   test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050

8. nutch index test/indexes test/crawldb linkdbtest/segments/20060522144050

..and now I am able to search...

Now I run

9. nutch generate test/crawldb test/segments -topN 1000

and I will end up to have a new segment  :   test/segments/20060522151957

10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957

From this point on I cannot make any progresses much


A) I have tried to merge the two segments into a new one with the idea to rerun 
an invertlinks and index  on it but:

nutch mergesegs test/segments -dir test/segmentswhatever I specify as outputdir or outputsegment I get errorsB) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a secondindexes directory, let say test/indexes1, an finally run the merge index on index2


  nutch invertlinks  test/linkdb  -dir test/segments

  This as created a new linkdb directory *NOT* under test as specified but as 
<user>/linkdb-1108390519

  nutch index  test/indexes1 test/crawldb linkdb test/segments/20060522144050
  nutch merge index2 test/indexes test/indexes1

  now I am not sure what to do; If I rename test/index2 to be test/indexes 
after having removed test/indexes
  I will not able to search anymore.


-Corrado




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Incremental crawl again ... (Please explain)

Reply via email to